![Page 1: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/1.jpg)
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture
Yun Zhang, Michael J. Voss
University of Toronto
Guansong Zhang, Raul Silvera
IBM Toronto Lab
Apr 18, 2023
![Page 2: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/2.jpg)
2
Agenda
Background Motivation Previous Work Adaptive Schedulers IBM Power 5 Architecture A Multi-Level Hierarchical Scheduler Evaluation Future Work
![Page 3: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/3.jpg)
3
Simultaneous Multi-Threading
Architecture
Several threads per physical processor
Threads share Caches Registers Functional Units
![Page 4: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/4.jpg)
4
Power 5 SMT Execution Resource
Resource 0
Resource 1
Resource n
……
……
…
Thread 0 Thread 1
Clock Cycles
Execution Resource
Resource 0
Resource 1
Resource n
……
……
…
Thread 0 Thread 1
Clock CyclesExecution Resource
Resource 0
Resource 1
Resource n
……
……
…
Thread 0 Thread 1
Clock Cycles
![Page 5: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/5.jpg)
5
OpenMP
OpenMPA standard API for shared memory
programmingAdd directives for parallel regions
Standard Loop SchedulersStaticDynamicGuidedRuntime
![Page 6: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/6.jpg)
6
OpenMP API#pragma omp parallel for shared(a, b) private(i, j) schedule(runtime)for ( i = 0; i < 100; i ++ ) {
for ( j = 0; j < 100; j ++) {a[i , j] = a[i , j] + b[i , j];
}}
An example of a parallel loop in C code. (Similar in Fortran)
……..
……..
……..…
.
….
….
….
….
j
i
T0 Tn
![Page 7: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/7.jpg)
7
Motivation OpenMP Applications
Designed for SMP systems Not aware of HT technology Understanding and controlling performance of
OpenMP applications on SMT processors is not trivial
Important performance issues on SMP system with SMT nodes Inter-thread data locality Instruction Mix SMT-related Load Balance
![Page 8: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/8.jpg)
8
Scaling (Spec & NAS)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
1 2 3 4 5 6 7 8
Number of Threads
Sp
eed
up
ammp
apsi
art
equake
mgrid
swim
wupwise
BT
CG
EP
FT
MG
SP1 Thread per Processor 1-2 Threads per Processor
4 Intel Xeon Processors with Hyperthre
ading
![Page 9: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/9.jpg)
9
Why do they scale poorly? Inter-thread data locality
cache misses Instruction Mix
functional units sharing benefit gained this way may outweigh cache
misses SMT-related Load Balance
We should balance work loads well among: processors threads running on the same physical processor.
![Page 10: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/10.jpg)
10
Previous Work:Runtime Adaptive Scheduler
Hierarchical SchedulingUpper level schedulerLower level scheduler
Select scheduler and the number of threads to run at runtimeOne thread per physical processorTwo threads per physical processor
![Page 11: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/11.jpg)
11
Two-Level Hierarchical Scheduler
![Page 12: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/12.jpg)
12
Traditional Scheduling
……..……..
……..
…. …. …. …. ….
Static Scheduling
……..……..
……..
…. …. …. …. ….
TnT0T0 Tn Ti Tk
Dynamic Scheduling
jj
i i
![Page 13: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/13.jpg)
13
Hierarchical Scheduling
Dynamic Scheduling
T01T00 Ti0 Ti1
……..
….
….…. ….
Static Scheduling
i
j
……..
……..
……..
…. …. …. …. ….
P0 Pi
……..
……..
……..
…. ….
![Page 14: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/14.jpg)
14
Why can we benefit fromruntime scheduler selection?
Many parallel loops in OpenMP applications are executed again and again.
Example
# of calls vs. Execution time
< 10 times
10 – 40 times
> 40 times
ammp 0% 0% 84.20%
apsi 0% 0% 82.55%
art 100% 0% 0%
equake 0.05% 0% 98.23%
mgrid 0% 0.11% 95.95%
swim 0.09% 0% 99.25%
wupwise 0.12% 0% 99.49%
BT 0% 0% 100%
CG 0.92% 3.5% 92.57%
EP 100% 0% 0%
MG 12.73% 12.87% 71.91%
SP 1.02% 0% 92.71%
for (k = 1; k<100; k++) { …………. calculate(); ………….}
void calculate () {#pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) {
……………; // calculation }}
![Page 15: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/15.jpg)
15
Adaptive Schedulers Region Based Scheduler
Select loop schedulers at runtime Parallel loops in one parallel region have to use the
same scheduler which may not be the best
Loop Based Scheduler Higher runtime overhead More accurate loop scheduler for each parallel loop
![Page 16: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/16.jpg)
16
Sample from NAS2004!$omp parallel default(shared) private(i,j,k)!$omp do schedule(runtime) do j=1,lastrow-firstrow+1
do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1enddo
enddo!$omp end do nowait!$omp do schedule(runtime) do i = 1, na+1
x(i) = 1.0D0 enddo!$omp end do nowait!$omp do schedule(runtime) do j=1, lastcol-firstcol+1
q(j) = 0.0d0z(j) = 0.0d0r(j) = 0.0d0p(j) = 0.0d0
enddo!$omp end do nowait!$omp end parallel
loop based scheduler picks a scheduler
region based scheduler picks one scheduler that applies to all three loops
loop based scheduler picks a scheduler
loop based scheduler picks a scheduler
![Page 17: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/17.jpg)
17
Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………
M1
P1P0
T1T0
P3P2
T3T2
Static Scheduler
![Page 18: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/18.jpg)
18
Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………
M1
P1P0
T1T0
P3P2
T3T2
Dynamic Scheduler
![Page 19: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/19.jpg)
19
Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………
M1
P1P0
T1T0
P3P2
T3T2
Affinity Scheduler
![Page 20: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/20.jpg)
20
Runtime Loop Scheduler SelectionPhase 1: Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads…………
T0
M1
P1P0
T3T2T1
P1P0
T7T6T5T4
Affinity Scheduler
Static
![Page 21: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/21.jpg)
21
Sample from NAS2004!$omp parallel default(shared) private(i,j,k)!$omp do schedule(runtime) do j=1,lastrow-firstrow+1
do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1enddo
enddo!$omp end do nowait!$omp do schedule(runtime) do i = 1, na+1
x(i) = 1.0D0 enddo!$omp end do nowait!$omp do schedule(runtime) do j=1, lastcol-firstcol+1
q(j) = 0.0d0z(j) = 0.0d0r(j) = 0.0d0p(j) = 0.0d0
enddo!$omp end do nowait!$omp end parallel
Static-Static, 8 threads
TSS, 4 threads
TSS, 4 threads
![Page 22: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/22.jpg)
22
Hardware Counter Scheduler Motivation
The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible
Algorithm Try different schedulers on parallel loops on a subset of the
benchmarks using training data Use the characteristic: cache miss, number of floating point
operations, number of micro-ops, load imbalance and the best scheduler for that loop as input
Feed the above data to classification software (we use C4.5) to build a decision tree
Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.
![Page 23: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/23.jpg)
23
Speedup on 4 Threads
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
amm
pap
si art
equa
ke
mgr
idsw
im
wupwise
BT(W)
CG EP MG
SP(W)
Avera
ge
Benchmarks
Sp
eed
up
static
dynamic
guided
afs
tss
original
RBS
LBS
HCS
4 Intel Xeon Processors with Hyperthre
ading
![Page 24: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/24.jpg)
24
Speedup on 8 Threads
1.00
1.50
2.00
2.50
3.00
3.50
4.00
amm
pap
si art
equa
ke
mgr
idsw
im
wupwise
BT(W)
CG EP MG
SP(W)
Avera
ge
Benchmarks
Sp
eed
up
static
dynamic
guided
afs
tss
original
RBS
LBS
HCS
4 Intel Xeon Processors with Hyperthre
ading
![Page 25: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/25.jpg)
25
IBM Power 5
Technology: 130nm Dual processor core 8-way superscalar Simultaneous Multi-
Threaded (SMT) core Up to 2 virtual processors 24% area growth per core
for SMT Natural extension to Power
4 design
![Page 26: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/26.jpg)
26
Single Thread
Single Thread has advantage when executing unit limited applications Floating or fixed point intensive workloads
Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread
Data locality on one SMT core is better with single thread for some applications
![Page 27: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/27.jpg)
27
Power 5 Multi-Chip Module (MCM)
Or Multi-Chipped Monster 4 processor chips
2 processors per chip
4 L3 cache chips
![Page 28: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/28.jpg)
28
Power5 64-way Plane Topology
Each MCM has 4 inter-connected processor chips
Each processor chip has two processors on chip
Each processor has SMT technology therefore two threads can be executed on it simultaneously
![Page 29: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/29.jpg)
29
Multi-Level Scheduler Loop Iterations
Iterations for Module 1
1st LevelScheduler
Iterations for Module i
Iterations for Module n
2nd LevelScheduler
2nd LevelScheduler
Iterations for Processor m
Iterations for Processor 1
Iterations for Processor m
Iterations for Processor 1
3rd LevelScheduler
3rd LevelScheduler
Iterations for Thread k
Iterations for Thread 1
Iterations for Thread k
Iterations for Thread 1
……………….
…….…….
……………….
………………. ……………….
![Page 30: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/30.jpg)
30
OpenMP Implementation
Outline Technique New subroutines
created with body of each parallel construct
Runtime routines receives as a parameter the address of the outlined procedure
![Page 31: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/31.jpg)
31
long main {_xlsmpParallelDoSetup_TPO(…)}
1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…)
void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return;}
while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. }
Outlined Functions
Source Code:#pragma omp parallel for shared(a,b) private(i)for ( i = 0; i < 100; i ++ ) { a = a + b;}
Runtime Library
![Page 32: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/32.jpg)
32
long main {_xlsmpParallelDoSetup_TPO(…)}
1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…)
void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return;}
while (hier_sched(…))) { ………… call main@OL@1(...); …………. }
Outlined Functions
Source Code:#pragma omp parallel for shared(a,b) private(i)for ( i = 0; i < 100; i ++ ) { a = a + b;}
Runtime Library
![Page 33: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/33.jpg)
33
1. Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2nd level scheduler and return
2. Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends
M0
P1P0
T3T2T1T0
M1
P1P0
T7T6T5T4
Root
Guided
Static Cyclic
![Page 34: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/34.jpg)
34
Hierarchical Scheduler
Guided as the 1st level scheduler Balance work loads among processors Reduce runtime overhead
Static Cyclic as the 2nd level scheduler Improve cache locality Reduce runtime overhead
….
T0 T1
Iteration space dividing using standard static scheduling
….
T0 T1
Iteration space dividing using static cyclic scheduling
T1 T1 T1T0 T0 T0
![Page 35: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/35.jpg)
35
Evaluation
IBM Power 5 System 4 Power 5 1904 MHz SMT processors 31872 M memory
Operating System AIX 5.3
Compiler: IBM XL C/C++, XL Fortran compiler
Benchmark SpecOMP2001
![Page 36: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/36.jpg)
36
Scalability of IBM Power 5 SMT Processors 1 through 8 threads
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 2 3 4 5 6 7 8Benchmarks
Speedup
ammp
applu
apsi
art
equake
gafort
mgrid
swim
wupwise
![Page 37: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/37.jpg)
37
0.80
0.85
0.90
0.95
1.00
1.05
1.10
ammp applu apsi art equake gafort mgrid swim wupwise
Benchmarks
Nor
mal
ized
Tim
e ov
er S
tatic
static
dynamic
guided
Hier
Evaluation on Power 5Execution Time Normalized to Default (Static) Scheduler
![Page 38: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/38.jpg)
38
Conclusion Standard schedulers are not aware of SMT technology Adaptive hierarchical schedulers take SMT specific char
acteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together.
OpenMP parallel applications running on Power 5 architecture with SMT has the same problem
Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001 Large improvements of 7% and 11% on some benchmarks Improves on average over all other standard OpenMP loop sche
dulers by at least 2%
![Page 39: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/39.jpg)
39
Future Work
Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM)
Explore performance on auto-parallelized benchmarks (SPEC CPU FP)
Examine mechanisms for determining best scheduler configuration at compile-time
Explore the use of helper threads on Power 5 Cache prefetching
![Page 40: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/40.jpg)
Thank You~
![Page 41: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/41.jpg)
41
(A cache miss comparison chart will be shown here)
If find a way to calculate the overall L2 load/store miss generally.
If not, will show the overhead of this optimization from the tprof data.
![Page 42: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/42.jpg)
42
Schedulers’ Speedup on 4 threads
1.00
1.50
2.00
2.50
3.00
3.50
4.00
Benchmarks
Sp
eed
up
static
dynamic
guided
afs
tss
original
![Page 43: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/43.jpg)
43
Scheduler’s Speedup on 8 Threads
1.00
1.50
2.00
2.50
3.00
3.50
4.00
ammp apsi art equake mgrid swim wupwise BT(W) CG EP MG SP(W) Average
Sp
ee
du
p
static
dynamic
guided
afs
tss
original
![Page 44: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/44.jpg)
44
Decision Tree Only one decision tree is
built offline, before executing the program
Apply that decision tree to loops at runtime without changing the tree
Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead
uops <= 3.62885e+08 :| cachemiss <= 111979 :| | uops > 748339 : static-4 | | uops <= 748339 :| | | l/s <= 167693 : static-4 (| | | l/s > 167693 : static-static | cachemiss > 111979 :| | floatpoint <= 1.52397e+07 :| | | cachemiss <= 384690 :| | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 :| | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 :| | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 :| l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 :| | imbalance <= 32236 : static-4 | | imbalance > 32236 :| | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 :| | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 :| | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardwar
e-counter schedulingEND hardware-counter scheduling
![Page 45: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d3a5503460f94a14e3b/html5/thumbnails/45.jpg)
45
(Load imbalance comparison chart will be shown here)
Generating……..