amphisbaena: modeling two orthogonal ways to hunt on heterogeneous many-cores an analytical...
TRANSCRIPT
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-coresan analytical performance model for boosting performance
Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li
State Key Laboratory of Computer ArchitectureInstitute of Computing Technology, C.A.S.
Univ. of Chinese Academy of Sciences
Trends in Cloud Computing The increasing computing demands
More massive More diverse High service level agreement(response time, throughput)
The computing platform to meet these demands Multicore to manycore Homogeneous to heterogeneous
Two Orthogonal Ways to Boost Performance Scale-out speedup: explore many cores for higher
thread-level parallelism
Scale-up speedup: explore heterogeneous cores for optimal application-core mapping
Quantifying Scale-out and Scale-up Speedup The overall performance
Type Issue Width ROB Size
Core-A 4 64
Core-B 6 96
Core-C 8 128
Indicate how to improve overall performance of each application.
How to figure out the application-specific scale-out and scale-up speedup?
Amphisbaena: an Analytical Approach to Model Performance
Amphisbaena, or shortly, Modeling the overall performance speedup coming from
two orthogonal ways
I’m I’m
The ratio of performance on target cores to current cores under the same multithreading configuration.
The ratio of performance on target multithreading configuration to current configuration on the same type of cores.
Experimental Setup
cluster-based layoutdistributed, banked LLC
directory-based MOESI protocol
Scale-out Speedup
– the serial part.– the parallelizable part.– the multithreading penalty.
Observation
– modulating constant.– synchronization waiting
cycles per kilo-instructions(SPKI).
– thread number.
– modulating constant.– misses waiting cycles per
kilo-instructions(MPKI).– thread number squared.
The Details of Multithreading Penalty
Coefficients Value Implementationsa0 1.837e-003 constant a1 0.05312 constant a2 -2.025e-005 constantk0 bias redundant computationsk1 SPKI bottleneck-identifying instructionsk2 MPKI built-in performance counters
offline
online
Alpha Model Accuracy
benchmarks 12phases 50threads 33(1,2,4,6…64)total space 633600samples 600
Our error is under 5% on average, which outperforms the error of Amdahl’s Law with error of 11.4%.
Scale-up Speedup
the frontend: issue width
• W [Big, Small]
the backend: ROB size
• R[Big, Small]
How to predict the CPI on various type of cores?
S B SB
B B S S
C0 C1
C2 C3
Observation
this trend is well approximated by a power law. this trend fits an exponential function well.
The Details of CPI Model
Coefficients Value Implementationsb0 0.2837 constant b1 1.1675 constant b2 1.8427 constantr bias b0×CPIbase
s memory intensity CPImem/CPIt computing intensity CPIbase/CPICPImem penalty with stalls CPI stack calculationCPIbase penalty without stalls CPI stack calculation
memory intensity.computing intensity.bias.
offline
online
online
Beta Model Accuracy
benchmarks 12phases 50core types 6total space 18000samples 600
Our error is kept below 8% on average, which outperforms the error of PIE with error of 12.2%.
Phi Model Accuracy
benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000samples 1080
The prediction error of overall performance is kept below 12% on average.
Orthogonality Validation
0: mmmityOrthogonal
benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000measured 2268
mmm ,, three measured values.
For most applications, the error about orthogonality is below 5% on average.
Application of Phi Model Using Phi for runtime management
Predict the performance speedup coming from scale-out and scale-up on any other target configurations online.
Invoke scheduling algorithm to figure out the optimal configuration in terms of maximizing performance.
The operating system enables the specified multithreading and application-core mapping.
Phi Scheduling
Dout Dup Phi
“application with higher scale-out speedup should spawn more thread.”
“application with largest scale-up speedup is allocated with the fastest type of cores.”
“decide the thread number to spawn for each application.”
“decide the cores to map for each application.”
“Phi scheduling use the heuristic algorithm to maximize performance.”
function
policy
algorithm
Performance Comparison
Baselines Scale-out Scale-upBias Dout memory-related samplesPIE Dout PIE modelStatic fixed thread number DupPhi Dout Dup
Phi averagely outperforms the other three baselines by 12.2% (Static), 13.3% (Bias) and 12.9% (PIE).
Related Works
Performance prediction and optimization periodically Only decided the number of threads/active cores
• CPR: Composable Performance Regression for Scalable Multiprocessor – [Benjamin C. Lee etc. MICRO2008]
• FDT: Feedback-Driven Threading Power-Efficient and High-Performance Execution of Multi-threaded Workloads on CMPs– [M. Aater Suleman etc. ASPLOS2008]
Only decided the type of heterogeneous cores• Single-ISA Heterogeneous Multi-core Architectures for
Multithreaded Workload Performance– [Rakesh Kumar etc. ISCA2004]
• Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE)– [Kenzo Van Craeynest etc. ISCA2012]
Conclusion Analytical model for performance prediction
Scale-out speedup Scale-up speedup Overall performance
Phi scheduling Apply for runtime management Return optimal performance
Thanks for Your Attention
Q&A