parallel processing prof. chung-ta king department of computer science national tsing hua university...

Parallel Processing

Prof. Chung-Ta KingDepartment of Computer ScienceNational Tsing Hua University

CS1103 電機資訊工程實習

(Contents from Saman Amarasinghe, Katherine Yelick, N. Guydosh)

2

The “Software Crisis”

“ To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.“

-- E. Dijkstra, 1972 Turing Award Lecture

3

The First Software Crisis

Time frame: ’60s and ’70s

Problem: assembly language programmingComputers could handle larger, more complex

programs

Needed to get abstraction and portability without losing performance

4

How Was the Crisis Solved?

High-level languages for von-Neumann machinesFORTRAN and C

Provided “common machine language” for uniprocessors

Common Properties

Single flow of control

Single memory image

Differences:

Register File

ISA

Functional Units

5

The Second Software Crisis

Time frame: ’80s and ’90s Problem: inability to build and maintain

complex robust applications requiring multi-million lines of code developed by hundreds of programmersComputers could handle larger, more complex

programs Needed to get composability, flexibility

and maintainabilityHigh-performance was not an issue

left for Moore’s Law

6

How Was the Crisis Solved?

Object oriented programmingC++, C# and Java

Also…Better tools: component librariesBetter software engineering methodology:

design patterns, specification, testing, code reviews

7

Today: Programmers are Oblivious to Processors Solid boundary between hardware and software Programmers don’t have to know the processor

High level languages abstract away the processors, ex. Java bytecode is machine independent

Moore’s law allows programmers to get good speedup without knowing anything about the processors

Programs are oblivious of the processor they work on any processor A program written in ’70 still works and is much faster

today This abstraction provides a lot of freedom for the

programmersBut, not anymore!

8

Moore’s Law Dominates

2X transistors/chip every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of IC chips would double roughly every 18 months.

Slide source: Jack Dongarra

Effects of shrinking sizes ...

9

Transistors and Clock Rate

i4004

i80286

i80386

i8080

i8086

R3000R2000

R10000

Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Year

Tra

nsi

stor

s

Growth in transistors per chip

Increase in clock rate

0.1

1

10

100

1000

1970 1980 1990 2000

Year

Clo

ck R

ate

(MH

z)

Performance not as expected? Just wait a year or two…

10

But, Not AnymorePerformancetapered out

11

Why? Hidden Parallelism Tapped Out

The ’80s: superscalar expansion50% per year improvement in performanceTransistors applied to implicit parallelism

multiple instruction issuedynamic scheduling: hardware discovers parallelism

between instructionsspeculative execution: look past predicted branchesnon-blocking caches: multiple outstanding memory

opsPipeline superscalar processor (10 CPI 1 CPI)

The ’90s: era of diminishing returns2-way to 6-way issue, out-of-order issue, branch

prediction (1 CPI 0.5 CPI)

12

The Origins of a Third Crisis Time frame: 2005 to 20?? Problem: we are running out of innovative ideas

for hidden parallelism and sequential performance is left behind by Moore’s law

Need continuous and reasonable performance improvements to support new features to support larger datasets

While sustaining portability, flexibility and maintainability without increasing complexity faced by the programmer critical to keep up with the current rate of evolution in

software

13

Other Forces for a Change

General-purpose unicores have stopped historic performance scalingDiminishing returns of more instruction-level

parallelismPower consumptionChip yieldWire delaysDRAM access latency

Go for multicore and parallel programming

14

Limit #1: Power Density

40048008

80808085

8086

286386

486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Po

wer

Den

sity

(W

/cm

2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurface

Source: Patrick Gelsinger, Intel

Scaling clock speed (business as usual) will not work

Power Wall: Can put more transistors on a chip than can afford to turn on

15

Power Efficiency (watts/spec)

16

Multicores Save Power

Multicores with simple cores decreases frequency and powerDynamic power is proportional to V2fC

Example: uniprocessor with power budget NIncrease frequency by 20%

Substantially increases power, by more than 50%But, only increase performance by 13%

Decrease frequency by 20% (e.g., simplifying core)Decreases power by 50%Can now add another simple corePower budget stays at N with increased performance

Source: John Cavazos

17

Why Parallelism Lowers Power

Highly concurrent systems are more power efficient

Hidden concurrency burns powerSpeculation, dynamic dependence checking,

etc.Push parallelism discovery to software

(compilers and application programmers) to save power

Challenge: Can you double the concurrency in your algorithms every 2 years?

18

Limit #2: Chip Yield Manufacturing costs and yield limit use of density

Moore’s (Rock’s) 2nd law: fabrication costs go up

Yield (% usable chips) drops

Parallelism can helpMore smaller, simpler

processors are easier to design and validate

Can use partially working chipse.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield

19

Limit #3: Speed of Light

Consider the 1 Tflop/s sequential machine:Data must travel some distance, r, to get from

memory to CPUTo get 1 data element per cycle, this means 1012

times per second at the speed of light, c=3x108 m/s. Thus r < c/1012 = 0.3 mm.

Now put 1 Tbyte of storage in 0.3mmx0.3mm:Each bit occupies about 1 square Angstrom, or

the size of a small atom. No choice but

parallelismr = 0.3 mm

1 Tflop/s, 1 Tbyte sequential machine

20

Limit #4: Memory Access

Growing processor-memory performance gap

Total chip performance still growing with Moore’s LawBandwidth rather than latency will be growing concern

21

Revolution is Happening Now Chip density is

continuing increase 2x every 2 years Clock speed is not Number of processor

cores may double instead

There is little or no hidden parallelism (ILP) to be found

Parallelism must be exposed to and managed by software

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

22

Multicore in Products All processor companies switch to MP (2X CPUs/2

yrs)

And at the same time, The STI Cell processor (PS3) has 8 cores The latest NVidia Graphics Processing Unit (GPU)

has 128 cores Intel has demonstrated an 80-core research chip

Manufacturer/Year

AMD/’05 Intel/’06 IBM/’04 Sun/’07

Processors/chip 2 2 2 8

Threads/processor

1 2 2 16

Threads/chip 2 4 4 128

23

The Rise of Multicores

24

IBM Cell Processor Consists of a

64-bit Power core, augmented with 8 specialized SIMD co-processors (SPU, Synergistic Processor Unit), and a coherent bus

25

Intel Xeon Processor 5300 Series

Dual-die quad core processor

26

Open Sparc T1

27

Parallel Computing Not New

Researchers have been using parallel computing for decades: Mostly in computational science and

engineeringProblems too large to solve on one computer;

use 100s or 1000s Many companies in the 80s/90s “bet” on

parallel computing and failedComputers got faster too quickly for there to

be a large market

Old: Parallelism only for High End Computing

New: Parallelism by Necessity

29

Why Parallelism? All major processor vendors are producing multicore

Every machine will soon be a parallel machine All programmers will be parallel programmers???

New software model Want a new feature? Hide the “cost” by speeding up code

first All programmers will be performance programmers???

Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there

Big open questions: What will be the killer apps for multicore machines How should the chips be designed, and programmed?

A View from Intel

What is a tumor? Is there a tumor here? What if the tumor progresses?

It is all about dealing efficiently with complex multimodal datasetsIt is all about dealing efficiently with complex multimodal datasets

Recognition Mining Synthesis

Source: Pradeep K Dubey, Intel

Visual InputStreams

Find an existing model instance

What is …? What if …?Is it …?

Recognition Mining Synthesis

Create a newmodel instance

Model

Most RMS apps are about enabling interactive (real-time) RMS Loop or iRMSMost RMS apps are about enabling interactive (real-time) RMS Loop or iRMS

SynthesizedVisuals

Learning &Modeling

Graphics Rendering + Physical Simulation

ComputerVision

RealityAugmentation

Source: Pradeep K Dubey, Intel

Emerging Workload Focus: iRMS

32

Technology Enables Two Paths

1. increasing performance, same price (& form factor)

33

Technology Enables Two Paths

2. constant performance, decreasing cost (& form factor)

34

Re-inventing Client/Server

“ The Datacenter is the Computer”Building sized computers: Google, MS …

“ The Laptop/Handheld is the Computer”‘ 08: Dell # laptops > # desktops?1B Cell phones/yr, increasing in functionWill desktops disappear? Laptops?

Laptop/Handheld as future client,Datacenter or “cloud” as future serverBut, energy budgets could soon dominate

facility costs

35

Summary of Trends Power is dominant concern for future systems Primary “knob” to reduce power is lower clock rates

and increase parallelism The memory wall (latency and bandwidth) will

continue to get worse Memory capacity will be increasingly limited/costly Entire spectrum of computing will need to address

parallelism performance is a software problem Handheld devices: to keep battery power Laptops/desktops: each new “feature” requires saving time

elsewhere High end computing facilities and data centers: to reduce

energy costs

36

Quiz

You might have used computers with dual-core CPU, but you do not have to write parallel programs. Your sequential programs can still run on such computers. So, what is the point of writing parallel programs? Please give your thoughts on this.[Hint:] What happen if you have a computer with 64 cores?

parallel processing prof. chung-ta king department of computer science national tsing hua university...

Documents