parallel processing prof. chung-ta king department of computer science national tsing hua university...
Post on 21-Dec-2015
219 views
TRANSCRIPT
Parallel Processing
Prof. Chung-Ta KingDepartment of Computer ScienceNational Tsing Hua University
CS1103 電機資訊工程實習
(Contents from Saman Amarasinghe, Katherine Yelick, N. Guydosh)
2
The “Software Crisis”
“ To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.“
-- E. Dijkstra, 1972 Turing Award Lecture
3
The First Software Crisis
Time frame: ’60s and ’70s
Problem: assembly language programmingComputers could handle larger, more complex
programs
Needed to get abstraction and portability without losing performance
4
How Was the Crisis Solved?
High-level languages for von-Neumann machinesFORTRAN and C
Provided “common machine language” for uniprocessors
Common Properties
Single flow of control
Single memory image
Differences:
Register File
ISA
Functional Units
5
The Second Software Crisis
Time frame: ’80s and ’90s Problem: inability to build and maintain
complex robust applications requiring multi-million lines of code developed by hundreds of programmersComputers could handle larger, more complex
programs Needed to get composability, flexibility
and maintainabilityHigh-performance was not an issue
left for Moore’s Law
6
How Was the Crisis Solved?
Object oriented programmingC++, C# and Java
Also…Better tools: component librariesBetter software engineering methodology:
design patterns, specification, testing, code reviews
7
Today: Programmers are Oblivious to Processors Solid boundary between hardware and software Programmers don’t have to know the processor
High level languages abstract away the processors, ex. Java bytecode is machine independent
Moore’s law allows programmers to get good speedup without knowing anything about the processors
Programs are oblivious of the processor they work on any processor A program written in ’70 still works and is much faster
today This abstraction provides a lot of freedom for the
programmersBut, not anymore!
8
Moore’s Law Dominates
2X transistors/chip every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of IC chips would double roughly every 18 months.
Slide source: Jack Dongarra
Effects of shrinking sizes ...
9
Transistors and Clock Rate
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tra
nsi
stor
s
Growth in transistors per chip
Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
Performance not as expected? Just wait a year or two…
11
Why? Hidden Parallelism Tapped Out
The ’80s: superscalar expansion50% per year improvement in performanceTransistors applied to implicit parallelism
multiple instruction issuedynamic scheduling: hardware discovers parallelism
between instructionsspeculative execution: look past predicted branchesnon-blocking caches: multiple outstanding memory
opsPipeline superscalar processor (10 CPI 1 CPI)
The ’90s: era of diminishing returns2-way to 6-way issue, out-of-order issue, branch
prediction (1 CPI 0.5 CPI)
12
The Origins of a Third Crisis Time frame: 2005 to 20?? Problem: we are running out of innovative ideas
for hidden parallelism and sequential performance is left behind by Moore’s law
Need continuous and reasonable performance improvements to support new features to support larger datasets
While sustaining portability, flexibility and maintainability without increasing complexity faced by the programmer critical to keep up with the current rate of evolution in
software
13
Other Forces for a Change
General-purpose unicores have stopped historic performance scalingDiminishing returns of more instruction-level
parallelismPower consumptionChip yieldWire delaysDRAM access latency
Go for multicore and parallel programming
14
Limit #1: Power Density
40048008
80808085
8086
286386
486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Po
wer
Den
sity
(W
/cm
2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurface
Source: Patrick Gelsinger, Intel
Scaling clock speed (business as usual) will not work
Power Wall: Can put more transistors on a chip than can afford to turn on
16
Multicores Save Power
Multicores with simple cores decreases frequency and powerDynamic power is proportional to V2fC
Example: uniprocessor with power budget NIncrease frequency by 20%
Substantially increases power, by more than 50%But, only increase performance by 13%
Decrease frequency by 20% (e.g., simplifying core)Decreases power by 50%Can now add another simple corePower budget stays at N with increased performance
Source: John Cavazos
17
Why Parallelism Lowers Power
Highly concurrent systems are more power efficient
Hidden concurrency burns powerSpeculation, dynamic dependence checking,
etc.Push parallelism discovery to software
(compilers and application programmers) to save power
Challenge: Can you double the concurrency in your algorithms every 2 years?
18
Limit #2: Chip Yield Manufacturing costs and yield limit use of density
Moore’s (Rock’s) 2nd law: fabrication costs go up
Yield (% usable chips) drops
Parallelism can helpMore smaller, simpler
processors are easier to design and validate
Can use partially working chipse.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield
19
Limit #3: Speed of Light
Consider the 1 Tflop/s sequential machine:Data must travel some distance, r, to get from
memory to CPUTo get 1 data element per cycle, this means 1012
times per second at the speed of light, c=3x108 m/s. Thus r < c/1012 = 0.3 mm.
Now put 1 Tbyte of storage in 0.3mmx0.3mm:Each bit occupies about 1 square Angstrom, or
the size of a small atom. No choice but
parallelismr = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
20
Limit #4: Memory Access
Growing processor-memory performance gap
Total chip performance still growing with Moore’s LawBandwidth rather than latency will be growing concern
21
Revolution is Happening Now Chip density is
continuing increase 2x every 2 years Clock speed is not Number of processor
cores may double instead
There is little or no hidden parallelism (ILP) to be found
Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
22
Multicore in Products All processor companies switch to MP (2X CPUs/2
yrs)
And at the same time, The STI Cell processor (PS3) has 8 cores The latest NVidia Graphics Processing Unit (GPU)
has 128 cores Intel has demonstrated an 80-core research chip
Manufacturer/Year
AMD/’05 Intel/’06 IBM/’04 Sun/’07
Processors/chip 2 2 2 8
Threads/processor
1 2 2 16
Threads/chip 2 4 4 128
24
IBM Cell Processor Consists of a
64-bit Power core, augmented with 8 specialized SIMD co-processors (SPU, Synergistic Processor Unit), and a coherent bus
27
Parallel Computing Not New
Researchers have been using parallel computing for decades: Mostly in computational science and
engineeringProblems too large to solve on one computer;
use 100s or 1000s Many companies in the 80s/90s “bet” on
parallel computing and failedComputers got faster too quickly for there to
be a large market
29
Why Parallelism? All major processor vendors are producing multicore
Every machine will soon be a parallel machine All programmers will be parallel programmers???
New software model Want a new feature? Hide the “cost” by speeding up code
first All programmers will be performance programmers???
Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there
Big open questions: What will be the killer apps for multicore machines How should the chips be designed, and programmed?
A View from Intel
What is a tumor? Is there a tumor here? What if the tumor progresses?
It is all about dealing efficiently with complex multimodal datasetsIt is all about dealing efficiently with complex multimodal datasets
Recognition Mining Synthesis
Source: Pradeep K Dubey, Intel
Visual InputStreams
Find an existing model instance
What is …? What if …?Is it …?
Recognition Mining Synthesis
Create a newmodel instance
Model
Most RMS apps are about enabling interactive (real-time) RMS Loop or iRMSMost RMS apps are about enabling interactive (real-time) RMS Loop or iRMS
SynthesizedVisuals
Learning &Modeling
Graphics Rendering + Physical Simulation
ComputerVision
RealityAugmentation
Source: Pradeep K Dubey, Intel
Emerging Workload Focus: iRMS
34
Re-inventing Client/Server
“ The Datacenter is the Computer”Building sized computers: Google, MS …
“ The Laptop/Handheld is the Computer”‘ 08: Dell # laptops > # desktops?1B Cell phones/yr, increasing in functionWill desktops disappear? Laptops?
Laptop/Handheld as future client,Datacenter or “cloud” as future serverBut, energy budgets could soon dominate
facility costs
35
Summary of Trends Power is dominant concern for future systems Primary “knob” to reduce power is lower clock rates
and increase parallelism The memory wall (latency and bandwidth) will
continue to get worse Memory capacity will be increasingly limited/costly Entire spectrum of computing will need to address
parallelism performance is a software problem Handheld devices: to keep battery power Laptops/desktops: each new “feature” requires saving time
elsewhere High end computing facilities and data centers: to reduce
energy costs
36
Quiz
You might have used computers with dual-core CPU, but you do not have to write parallel programs. Your sequential programs can still run on such computers. So, what is the point of writing parallel programs? Please give your thoughts on this.[Hint:] What happen if you have a computer with 64 cores?