zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/1.jpg)
Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1
Threads vs. Caches: Modeling the Behavior of Parallel Workloads
1Technion – Israel Institute of Technology, 2Microsoft Corporation
![Page 2: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/2.jpg)
Challenges: Single-core performance trend is gloomy
Exploit chip-multiprocessors with multithreaded applications
The memory gap is paramount Latency, bandwidth, power
2
Chip-Multiprocessor Era
2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]
Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution
How do they play together? How do we make the most out of them?
![Page 3: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/3.jpg)
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
3
Outline
3
![Page 4: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/4.jpg)
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
4
Outline
4
![Page 5: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/5.jpg)
Cache-Machines vs. MT-Machines
# of Threads
Cache/Thread
Thread Context
Cache
Cache Architecture
Region
Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)
MT Architecture
Region
Intel’s Larrabee
…
Nvidia’s GT200
5
Nvidia’s Fermi
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc
What are the basic tradeoffs? How will workloads behave across the range?
Predicting performance
![Page 6: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/6.jpg)
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
6
Outline
6
![Page 7: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/7.jpg)
Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..
A Unified Machine Model
7
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Cache
To Memory
Threads Architectural States
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
C C
C C
C C
C C
C
C
C
C
![Page 8: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/8.jpg)
Cache Machines
8
C
Many cores (each may have its private L1) behind a shared cache
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
Cache
To Memory
C
C
C
C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
# Threads
Performance
Cache Non Effective point (CNE)
![Page 9: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/9.jpg)
Memory latency shielded by multiple thread execution
Multi-Thread Machines
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
To Memory
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Threads Architectural States
Ban
dw
idth
L
imit
atio
ns
# Threads
PerformanceMax performance
executionMemory access
9
![Page 10: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/10.jpg)
Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)
Every 1/rm instruction accesses memory A thread executes 1/rm instructions
Then stalls for tavg cycles
tavg=Average Memory Access Time (AMAT) [cycles]
10
Cache
Thread Context
t [cycles]
ld
1CPIexerm
avgt
ld
![Page 11: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/11.jpg)
PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles
threads needed to fully utilize each PE
Analysis (2/3)
t [cycles]
ld
1CPIexerm
avgt
ld ld ld ld
1CPIexerm
1exe
avg
m
CPI
r
t
1CPIexerm
11
Cache
Thread Context
![Page 12: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/12.jpg)
Analysis (3/3) Machine utilization:
Performance in Operations Per Seconds [OPS]:
1min 1, threads
avgm
PEexe
rN tCPI
n
Number of available threads
[ ]PEexe
fPerformance N OPS
CPI
Peak Performance
#Threads needed to utilize a single PE
12
Cache
Thread Context
![Page 13: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/13.jpg)
Performance Model
13
$ $ $
,
min , [ ]1 $,
( , ) 1 ( , )
PEexe
max
m reg hit threads
max
ex m hit hit mem
Power
fN
CPI
BWPerformance OPS
r b P n
e r P S n e P S n e
1 av
threads
mPE
exg
e
n
rN
CPIt
min 1 ,Machine Utilization
$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n
PE Utilization
Off-Chip BW
Power
![Page 14: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/14.jpg)
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
14
Outline
14
![Page 15: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/15.jpg)
15
# Threads
3 regions: Cache efficiency region, The Valley, MT efficiency region
Unified Machine PerformanceP
erfo
rman
ce
Ca
ch
e r
egio
n
MT regionThe Valley
![Page 16: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/16.jpg)
0
100
200
300
400
500
600
700
800
900
1000
1100
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1000
0
1100
0
1200
0
1300
0
1400
0
1500
0
1600
0
1700
0
1800
0
1900
0
2000
0
GO
PS
Number Of Threads
Performance for Different Cache Sizes (Limited BW)
no $
16M
32M
64M
128M
perfect $
Increase in cache size cache suffices for more in-flight threads Extends the $ region
17
Increase in cache size
Cache Size Impact
..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point
![Page 17: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/17.jpg)
Simulation results from the PARSEC workloads kit Swaptions:
Perfect Valley
Hit Rate Function Impact
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
19
![Page 18: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/18.jpg)
Simulation results from the PARSEC workloads kit Raytrace:
Monotonically-increasing performance
Hit Rate Function Impact
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
20
![Page 19: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/19.jpg)
Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads
Threads
Per
form
ance
Hit Rate Dependency – 3 ClassesP
erfo
rman
ce
# Threads
21
![Page 20: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/20.jpg)
Simulation results from the PARSEC workloads kit Canneal
Not enough parallelism available
Workload Parallelism Impact
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
22
![Page 21: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/21.jpg)
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
23
Outline
23
![Page 22: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/22.jpg)
A high-level model for many-core engines A unified framework for machines and workloads from across the range
A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena
First step towards escaping the valley
24
Summary
24
Thank [email protected]
![Page 23: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/23.jpg)
25
Backup
25
![Page 24: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/24.jpg)
26
Model Parameters
26
![Page 25: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/25.jpg)
27
Model Parameters
27
Parameter Description
NPENumber of PEs (in-order processing elements)
S$Cache size [Bytes]
NmaxMaximal number of thread contexts in the register file
CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]
f Processor frequency [Hz]
t$Cache latency [cycles]
tmMemory latency [cycles]
BWmaxMaximal off-chip bandwidth [GB/sec]
bregOperands size [Bytes]
Machine parameters:
![Page 26: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/26.jpg)
28
Model Parameters
28
Workload parameters:
Parameter Description
n Number of threads that execute or are in ready state (not blocked) concurrently
rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]
Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s
![Page 27: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/27.jpg)
29
Model Parameters
29
Power parameters:
Parameter Description
eexEnergy per operation [j]
e$Energy per cache access [j]
emem Energy per memory access [j]
PowerleakageLeakage power [W]
![Page 28: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/28.jpg)
30
Parsec Workloads
30
![Page 29: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/29.jpg)
Model Validation, PARSEC Workloads
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Dedup
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of ThreadsP
erf
orm
an
ce
(G
OP
S)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
Bodytrack
0
1
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180 200
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Blackscholes
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)Analytical Model
Simulation
Cache Hit Rate
![Page 30: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/30.jpg)
Related Work
32
![Page 31: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel](https://reader031.vdocuments.net/reader031/viewer/2022032309/56649d2f5503460f94a0685d/html5/thumbnails/31.jpg)
Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010
Related Work
33
Agrawal, TPDS-1992
Saavedra-Barrera and Culler, Berkeley 1991
Sorin et al., ISCA-1998
Hong and Kim, ISCA-2009
Baghsorkhi et al., PPoPP-2010
Thread Context
Cache
Cache Architecture
Region
MT Architecture
Region
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc