1 eece571r -- harnessing massively parallel processors matei/eece571/ lecture 1: introduction matei...
TRANSCRIPT
![Page 1: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/1.jpg)
1
EECE571R -- Harnessing Massively
Parallel Processors
http://www.ece.ubc.ca/~matei/EECE571/
Lecture 1: Introduction
Matei Ripeanumatei at ece dot ubc dot ca
Acknowledgement: some slides borrowed from presentations by Jim Demel, Horst Simon, Mark D. Hill, David Patterson, Saman Amarasinghe, Richard Brunner, Luddy Harrison, Jack Dongarra
![Page 2: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/2.jpg)
2
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of this course
Other problems too
Including your laptops and handhelds
all
But things are improving
![Page 3: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/3.jpg)
3
Units of Measure
• High Performance Computing (HPC) units are:- Flop: floating point operation
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8bytes)
• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes
Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes
Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes
Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes
Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes
Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes
Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
• Current fastest (public) machine ~ 2.6 Pflop/s- Up-to-date list at www.top500.org
![Page 4: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/4.jpg)
Performance evolution – Top500 List
![Page 5: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/5.jpg)
5
Why powerful computers are
parallel
circa 1991-2006
all (2007)
![Page 6: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/6.jpg)
6
Tunnel Vision by Experts
• “I think there is a world market for maybe five computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
- Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”- Bill Gates, chairman of Microsoft,1981.
• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
- Ken Kennedy, CRPC Directory, 1994
Slide source: Warfield et al.
![Page 7: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/7.jpg)
7
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
![Page 8: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/8.jpg)
8
Microprocessor Transistors per Chip
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
• Growth in transistors per chip • Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
![Page 9: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/9.jpg)
9
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a factor of x ?
• Clock rate goes up by x because wires are shorter, smaller gate capacities
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !- typically x3 is devoted to either on-chip
- parallelism: hidden parallelism such as ILP- locality: caches
• So most programs significantly faster, without changing them
![Page 10: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/10.jpg)
10
But there are limiting forces
• Cost - Moore’s 2nd law (Rock’s law):
costs go up
• Yield- The percentage of the chips that
are usable
-E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield
Manufacturing costs and yield problems limit use of density
![Page 11: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/11.jpg)
11
Power Density Limits Serial Performance
![Page 12: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/12.jpg)
12
Revolution is Happening Now
• Chip density is continuing increase ~2x every 2 years
- Clock speed is not!
- Number of processor cores may double instead
• There is little or no more hidden parallelism (ILP) to be found
• Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
![Page 13: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/13.jpg)
13
Parallelism in 2011?
• These arguments are no longer theoretical
• All major processor vendors are producing multicore chips- Every machine will soon be a parallel machine
- To keep doubling performance, parallelism must double
• Which commercial applications can use this parallelism?- Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?- New software model needed
- Try to hide complexity from most programmers – eventually
- In the meantime, need to understand it
• Computer industry betting on this change- But does not have all the answers
![Page 14: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/14.jpg)
14
More Limits: How fast can a serial computer be?
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory to processor.
- To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each bit occupies about 1 square Angstrom, or the size of a small atom.
• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
![Page 15: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/15.jpg)
Performance Development
0.11
10
1001000
10000100000
1E+061E+071E+081E+09
1E+101E+11
1996
2002
2008
2014
2020
1 Eflop/ s
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s SUM
N=1
N=500
2.6PFlops (2010)
![Page 16: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/16.jpg)
Minimum
Average
Maximum
1
10
100
1,000
10,000
100,000
1,000,000
# p
rocessors
.
Concurrency Levels
notebook computer
![Page 17: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/17.jpg)
Moore’s Law reinterpreted
• Number of cores per chip will double every two years
• Clock speed will not increase (possibly decrease)
• Need to deal with systems with large number of concurrent threads - Millions for HPC systems
- Thousands for servers
- Hundreds for workstations and notebooks
• Need to deal with inter-chip parallelism as well as intra-chip parallelism
![Page 18: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/18.jpg)
18
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
![Page 19: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/19.jpg)
19
Computational Science
Nature, March 23, 2006
“An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006
![Page 20: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/20.jpg)
20
Drivers for Change
• Continued exponential increase in computational power simulation is becoming third pillar of science, complementing
theory and experiment
• Continued exponential increase in experimental data techniques and technology in data analysis, visualization,
analytics, networking, and collaboration tools are becoming essential in all data rich scientific applications
![Page 21: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/21.jpg)
21
Simulation: The Third Pillar of Science
• Traditional scientific and engineering method:(1) Do theory or paper design
(2) Perform experiments or build system
• Limitations: –Too difficult—build large wind tunnels
–Too expensive—build a throw-away passenger jet
–Too slow—wait for climate or galactic evolution
–Too dangerous—weapons, drug design, climate
experimentation
• Computational science and engineering paradigm:(3) Use high performance computer systems
to simulate and analyze the phenomenon
- Based on known physical laws and efficient numerical methods
- Analyze simulation results with computational tools and methods beyond what is used traditionally for experimental data analysis
Simulation
Theory Experiment
![Page 22: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/22.jpg)
22
What Supercomputers Do
One Example- simulation replacing experiment that is too dangerous
![Page 23: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/23.jpg)
23
Global Climate Modeling Problem
• Problem is to compute:f(latitude, longitude, elevation, time) “weather” =
(temperature, pressure, humidity, wind velocity)
• Approach:- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+t given t
• Uses:- Predict major events,
e.g., El Nino
- Use in setting air emissions standards
- Evaluate global warming scenarios
Source: http://www.epm.ornl.gov/chammp/chammp.html
![Page 24: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/24.jpg)
24
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours) 56 Gflop/s
- Climate prediction (50 years in 30 days) 4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours) 288 Tflop/s
• To double the grid resolution / time resolution, computation is 8x to 16x
• State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more
• Current models are coarser than this
![Page 25: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/25.jpg)
25
High Resolution Climate Modeling on NERSC-3 – P. Duffy,
et al., LLNL
![Page 26: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/26.jpg)
26
Which commercial applications require parallelism?
Embed SPEC DB Games ML HPC1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body
10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid
• Claim: parallel architecture, language, compiler … must do at least these well to run future parallel apps well• Note: MapReduce is embarrassingly parallel; FSM embarrassingly sequential?
Analyzed in detail in “Berkeley View” reportwww.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
![Page 27: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/27.jpg)
27
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
![Page 28: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/28.jpg)
28
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming even harder than sequential programming.
![Page 29: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/29.jpg)
29
“Automatic” Parallelism in Modern Machines• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)- multiple instructions execute per clock cycle
• Memory system parallelism- overlap of memory operations with computation
• OS parallelism- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
![Page 30: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/30.jpg)
Amdahl’s Law
• Simple software assumption - Fraction F of execution time perfectly parallelizable
- No Overhead for
– Scheduling
– Synchronization
– Communication, etc.
- Fraction 1 – F Completely Serial
• Time on 1 core = (1 – F) / 1 + F / 1 = 1
• Time on N cores = (1 – F) / 1 + F / N
![Page 31: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/31.jpg)
Finding Enough Parallelism
• Implications:- Attack the common case when introducing parallelization's: When F
is small, optimizations will have little effect.
- Even if the parallel part speeds up perfectly performance is limited by the sequential part
- As N approaches infinity, speedup is bound by 1/(1 – F ).
- The aspects you ignore will limit speedup
• Discussion:- Can you ever obtain super-linear speedups?
Amdahl’s Speedup =1
+1 - F
1
F
N
![Page 32: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/32.jpg)
32
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to getting desired speedup
• Parallelism overheads include:- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work
![Page 33: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/33.jpg)
33
Locality and Parallelism
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
![Page 34: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/34.jpg)
34
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Goal: find algorithms that minimize communication, not necessarily arithmetic
![Page 35: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/35.jpg)
35
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load- Sometimes can determine work load, divide up evenly, before starting
- “Static Load Balancing”
- Sometimes work load changes dynamically, need to rebalance dynamically
- “Dynamic Load Balancing”
![Page 36: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/36.jpg)
36
Building Parallel Software
• 2 types of programmers 2 layers
• Efficiency Layer (10% of today’s programmers)- Expert programmers build Libraries implementing motifs, “Frameworks”,
OS, ….
- Highest fraction of peak performance possible
• Productivity Layer (90% of today’s programmers)- Domain experts / Naïve programmers productively build parallel applications
by composing frameworks & libraries
- Hide as many details of machine, parallelism as possible
- Willing to sacrifice some performance for productive programming
• You may want to work at either level- Goal of this course: understand enough of the efficiency layer to use
parallelism effectively
![Page 37: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/37.jpg)
37
Improving Real Performance
0.1
1
10
100
1,000
2000 2004T
eraf
lop
s1996
Peak Performance grows exponentially, a la Moore’s Law
But efficiency (the performance relative to the hardware peak) has declined
was 40-50% on the vector supercomputers of 1990s
Can be as little as 5-10% on parallel supercomputers of today
Close the gap through ... Mathematical methods and algorithms that
achieve high performance on a single processor and scale to thousands of processors
More efficient programming models and tools for massively parallel supercomputers
PerformanceGap
Peak Performance
Real Performance
![Page 38: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/38.jpg)
38
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
![Page 39: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/39.jpg)
Hybrid architectures in Top 500 sites
![Page 40: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/40.jpg)
Amdahl’s Law [1967]
Amdahl’s Speedup =1
+1 - F
1
F
N
![Page 41: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/41.jpg)
Designing a milticore platform
• Designers must confront single-core design options- Instruction fetch, wakeup, select
- Execution unit configuration & operand bypass
- Load/queue(s) & data cache
- Checkpoint, log, runahead, commit.
• As well as additional degrees of freedom- How many cores? How big each?
- Shared caches: How many levels? How many banks?
- On-chip interconnect: bus, switched?
![Page 42: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/42.jpg)
Want Simple Multicore Hardware Model
To Complement Amdahl’s Simple Software Model
(1) Chip Hardware Roughly Partitioned into- (I) Multiple Cores (with L1 caches)
- (II) The Rest (L2/L3 cache banks, interconnect, pads, etc.)- Assume: Changing Core Size/Number does NOT change The Rest
(2) Resources for Multiple Cores Bounded- Bound of N resources per chip for cores
- Due to area, power, cost ($$$), or multiple factors
- Bound = Power? (but pictures here use Area)
![Page 43: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/43.jpg)
Simple Multicore Hardware Model, (cont)
(3) Architects can improve single-core performance using more of the bounded resource
• A Simple Base Core- Consumes 1 Base Core Equivalent (BCE) resources
- Provides performance normalized to 1
• An Enhanced Core (in same process generation)- Consumes R x BCEs
- Performance as a function or R: Perf(R)
• What does function Perf(R) look like?
![Page 44: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/44.jpg)
More on Enhanced Cores
• (Performance Perf(R) consuming R BCEs resources)
• If Perf(R) > R Always enhance core- Cost-effectively speedups both sequential & parallel
• Therefore, equations assume Perf(R) < R
• Graphs Assume Perf(R) = square root of R- 2x performance for 4 BCEs, 3x for 9 BCEs, etc.
- Why? Models diminishing returns with “no coefficients”
• Two options symmetric / asymmetric multicore chips
![Page 45: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/45.jpg)
Q1: How Many (Symmetric) Cores per Chip?
• Each Chip Bounded to N BCEs (for all cores)
• Each Core consumes R BCEs
• Assume Symmetric Multicore = All Cores Identical
• Therefore, N/R Cores per Chip — (N/R)*R = N
• For an N = 16 BCE Chip:
Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core
![Page 46: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/46.jpg)
Performance of Symmetric Multicore Chips
• Serial Fraction 1-F uses 1 core at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction uses N/R cores at rate Perf(R) each
• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
• Therefore, w.r.t. one base core:
• Implications?
Symmetric Speedup =1
+1 - F
Perf(R)
F * R
Perf(R)*N
Enhanced Cores speed Serial & Parallel
![Page 47: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/47.jpg)
Symmetric Multicore Chip, N = 16 BCEs
F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))
Need more parallelism to have multicore optimal!
(16 cores) (8 cores) (2 cores) (1 core)
F=0.5R=16,Cores=1,Speedup=4
(4 cores)
![Page 48: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/48.jpg)
Symmetric Multicore Chip, N = 16 BCEs
At F=0.9, Multicore optimal, but speedup limited
Need to obtain even more parallelism!
F=0.5R=16,Cores=1,Speedup=4
F=0.9, R=2, Cores=8, Speedup=6.7
![Page 49: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/49.jpg)
Symmetric Multicore Chip, N = 16 BCEs
F matters: Amdahl’s Law applies to multicore chips
Researchers should target parallelism F first
F1, R=1, Cores=16, Speedup16
![Page 50: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/50.jpg)
Symmetric Multicore Chip, N = 16 BCEs
As Moore’s Law enables N to go from 16 to 256 BCEs,
More core enhancements? More cores? Or both?
Recall F=0.9, R=2, Cores=8, Speedup=6.7
![Page 51: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/51.jpg)
Symmetric Multicore Chip, N = 256 BCEs
As Moore’s Law increases N, often need enhanced core designs
Researcher should target single-core performance too
F=0.9R=28 (vs. 2)Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)CORE ENHANCEMENTS!
F1R=1 (vs. 1)Cores=256 (vs. 16)Speedup=204 (vs. 16) MORE CORES!
F=0.99R=3 (vs. 1)
Cores=85 (vs. 16)Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS& MORE CORES!
![Page 52: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/52.jpg)
Outline
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
![Page 53: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/53.jpg)
Asymmetric (Heterogeneous) Multicore Chips
• Symmetric Multicore Required All Cores Equal
• Why Not Enhance Some (But Not All) Cores?
• For Amdahl’s Simple Software Assumptions- One Enhanced Core
- Others are Base Cores
• How does this affect our hardware model?
![Page 54: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/54.jpg)
How Many Cores per Asymmetric Chip?
• Each Chip Bounded to N BCEs (for all cores)
• One R-BCE Core leaves N-R BCEs
• Use N-R BCEs for N-R Base Cores
• Therefore, 1 + N - R Cores per Chip
• For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core& Twelve 1-BCE base cores
![Page 55: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/55.jpg)
Performance of Asymmetric Multicore Chips
• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
• Parallel Fraction F- One core at rate Perf(R)
- N-R cores at rate 1
- Parallel time = F / (Perf(R) + N - R)
• Therefore, w.r.t. one base core:
Asymmetric Speedup =1
+1 - F
Perf(R)
F
Perf(R) + N - R
![Page 56: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/56.jpg)
Asymmetric Multicore Chip, N = 256 BCEs
Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
(256 cores) (253 cores) (193 cores) (1 core)(241 cores)
![Page 57: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/57.jpg)
57
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
![Page 58: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/58.jpg)
Course Mechanics
Course page: http://www.ece.ubc.ca/~matei/EECE571/
Office hours: after each class or by appointment (email me)
Email: matei @ ece.ubc.caOffice: KAIS 4033
![Page 59: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/59.jpg)
EECE 571R: Course Goals
• Primary- Gain understanding of fundamental issues that affect design of:
- Parallel applications for massively multicore processors
- Survey main current research themes
• Secondary- By studying a set of outstanding papers, build knowledge of how to do
& present research
- Learn how to read papers & evaluate ideas
What I’ll Assume You Know• You are familiar with
- C programming; programming for traditional multicore machines threading, synchronization
- Traditional CPU architectures: caching, pipelines,
• If there are things that don’t make sense, ask!
![Page 60: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/60.jpg)
This class
• Weekly schedule (tentative)
• Note: Dates are tentative- But always on Mondays 5-8pm, KAIS 4018
Class structure:
• Lectures – 1h- Technology
- Research topic
• Paper discussion – 1h
• Projects – 1h
![Page 61: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/61.jpg)
Administravia: Grading
• Paper reviews:25%
• Class participation 15%
• Discussion leading: 10%
• Project: 50%
![Page 62: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/62.jpg)
Administravia: Paper Reviewing (1)
• Goals:- Think of what you read
- Expand your knowledge beyond the papers that are assigned
- Get used to writing paper reviews
• Reviews due by midnight the day before the class
• Have an eye on the writing style / Be professional in your writing- Clarity
- Beware of traps: learn to use them in writing and detect them in reading
- Detect (and stay away from) trivial claims.
![Page 63: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/63.jpg)
Administravia: Paper Reviewing (2)
Follow the format provided when relevant.
• Summarize the main contribution of the paper
• Critique the main contribution: - Significance:
- Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution).
- More importantly: Explain your rating in a sentence or two.
- Rate how convincing the methodology is.
- Do the claims and conclusions follow from the experiments?
- Are the assumptions realistic?
- Are the experiments well designed?
- Are there different experiments that would be more convincing?
- Are there other alternatives the authors should have considered?
- (And, of course, is the paper free of methodological errors?)
- What is the most important limitation of the approach?
![Page 64: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/64.jpg)
Administravia: Paper Reviewing (3)
• What are the two strongest and/or most interesting ideas in the paper?
• What are the two most striking weaknesses in the paper?
• Name two questions that you would like to ask the authors.
• Detail an interesting extension to the work not mentioned in the future work section.
• Optional comments on the paper that you’d like to see discussed in class.
![Page 65: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/65.jpg)
Administravia: Discussions Leading
• Come prepared! - Prepare a 3-5 minute summary of the paper
- Prepare discussion outline
- Prepare questions:
- “What if”s
- Unclear aspects of the solution proposed
- …
- Similar ideas in different contexts
- Initiate short brainstorming sessions
• Leaders do NOT need to submit paper reviews
• Main goals: - Keep discussion flowing
- Keep discussion relevant
- Engage everybody (I’ll have an eye on this, too)
![Page 66: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/66.jpg)
Administravia: Projects
• It’s just a class, not your PhD.
• Aim high!- Goal: With one/two extra months of work you project should be ready to be
submitted to a decent conference / publication
- It is doable!
• Combine with your research if relevant to this course
• Get informal approval from all instructors if you overlap final projects:
- Don’t sell the same piece of work twice
- You can get more than twice as many results with less than twice as much work
![Page 67: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/67.jpg)
Administravia: Project timeline (tentative)
• 3rd week – 3-slide idea presentation : “What is the (research) question you aim to answer?”
• 5th week: 2-page project proposal
--------------------- 1-week --- spring reading period
• 8th week: 4-page Midterm project due- Have a clear image of what’s possible/doable
- Report preliminary results
- Includes related work
• Final week [see schedule] In-class project presentation- Presentation
- Demo, if appropriate
- 8-page write-up
![Page 68: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/68.jpg)
Next Class (Mon, 17/01)
• Note room change: KAIS 4018
• Discussion of - Project ideas
- Papers
To do:
• Subscribe to mailing list
• Volunteers for discussion leaders for class next week
![Page 69: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/69.jpg)
69
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Overview of programming models (software) and tools for massively parallel processors
• Some important parallel applications and the algorithms
• Performance analysis and tuning
• Exposure to various open research questions
![Page 70: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/70.jpg)
Questions?
![Page 71: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/71.jpg)
71
Extra slides
![Page 72: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/72.jpg)
Open research questions (Dongarra)
![Page 73: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/73.jpg)
Open research questions (Hwu)
• Accelerate computations that have no good parallel structure today.
- E.g., graph analysis
• Cata distriubtions cause catastrophical load imbalances in parallel algorithms
- free-scal egraphs, MRI spiral scan, astronomy
• Computatios with no data reuse - E.g., matrix vector multiplcation
• Algorithm optimizations that are hard to do today- how do I extract locality and regularity of access
![Page 74: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/74.jpg)
74
Applications
• Applications:1. What are the apps?
2. What are kernels of apps?
• Hardware:3. What are the HW building blocks?
4. How to connect them?
• Programming Model / Systems Software:
5. How to describe apps and kernels?
6. How to program the HW?
• Evaluation: 7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
CS267 focus is here
![Page 75: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649e5f5503460f94b58a00/html5/thumbnails/75.jpg)
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rMotifs/Dwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
Par Lab Research OverviewEasy to write correct programs that run efficiently on
manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Cor
rect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Efficiency Language Compilers