jun ma, guihai yan, yinhe han and xiaowei li state key laboratory of computer architecture

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-coresan analytical performance model for boosting performance

Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li

State Key Laboratory of Computer ArchitectureInstitute of Computing Technology, C.A.S.

Univ. of Chinese Academy of Sciences

Trends in Cloud Computing The increasing computing demands

More massive More diverse High service level agreement(response time, throughput)

The computing platform to meet these demands Multicore to manycore Homogeneous to heterogeneous

Two Orthogonal Ways to Boost Performance Scale-out speedup: explore many cores for higher

thread-level parallelism

Scale-up speedup: explore heterogeneous cores for optimal application-core mapping

Quantifying Scale-out and Scale-up Speedup The overall performance

Type Issue Width ROB SizeCore-A 4 64

Core-B 6 96

Core-C 8 128

Indicate how to improve overall performance of each application.

How to figure out the application-specific scale-out and scale-up speedup?

Amphisbaena: an Analytical Approach to Model Performance

Amphisbaena, or shortly, Modeling the overall performance speedup coming from

two orthogonal ways

I’m I’m

The ratio of performance on target cores to current cores under the same multithreading configuration.

The ratio of performance on target multithreading configuration to current configuration on the same type of cores.

Experimental Setup

cluster-based layoutdistributed, banked LLC

dir ect o ry- based MOESI p ro to col

Scale-out Speedup

– the serial part.– the parallelizable part.– the multithreading penalty.

Observation

– modulating constant.– synchronization waiting cycles

per kilo-instructions(SPKI).– thread number.

– modulating constant.– misses waiting cycles per kilo-

instructions(MPKI).– thread number squared.

The Details of Multithreading Penalty

Coefficients Value Implementationsa0 1.837e-003 constant a1 0.05312 constant a2 -2.025e-005 constantk0 bias redundant computationsk1 SPKI bottleneck-identifying instructionsk2 MPKI built-in performance counters

offline

online

Alpha Model Accuracy

benchmarks 12phases 50threads 33(1,2,4,6…64)total space 633600samples 600

Our error is under 5% on average, which outperforms the error of Amdahl’s Law with error of 11.4%.

Scale-up Speedup

the frontend: issue width

• W [Big, Small]

the backend: ROB size

• R[Big, Small]

How to predict the CPI on various type of cores?

S B SB

B B S S

C0 C1

C2 C3

Observation

this trend is well approximated by a power law. this trend fits an exponential function well.

The Details of CPI Model

Coefficients Value Implementationsb0 0.2837 constant b1 1.1675 constant b2 1.8427 constantr bias b0×CPIbase

s memory intensity CPImem/CPIt computing intensity CPIbase/CPICPImem penalty with stalls CPI stack calculationCPIbase penalty without stalls CPI stack calculation

memory intensity.computing intensity.bias.

offline

online

online

Beta Model Accuracy

benchmarks 12phases 50core types 6total space 18000samples 600

Our error is kept below 8% on average, which outperforms the error of PIE with error of 12.2%.

Phi Model Accuracy

benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000samples 1080

The prediction error of overall performance is kept below 12% on average.

Orthogonality Validation

0: mmmityOrthogonal

benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000measured 2268

mmm ,, three measured values.

For most applications, the error about orthogonality is below 5% on average.

Application of Phi Model Using Phi for runtime management

Predict the performance speedup coming from scale-out and scale-up on any other target configurations online.

Invoke scheduling algorithm to figure out the optimal configuration in terms of maximizing performance.

The operating system enables the specified multithreading and application-core mapping.

Phi Scheduling

Dout Dup Phi

“application with higher scale-out speedup should spawn more thread.”

“application with largest scale-up speedup is allocated with the fastest type of cores.”

“decide the thread number to spawn for each application.”

“decide the cores to map for each application.”

“Phi scheduling use the heuristic algorithm to maximize performance.”

function

policy

algorithm

Performance Comparison

Baselines Scale-out Scale-upBias Dout memory-related samplesPIE Dout PIE modelStatic fixed thread number DupPhi Dout Dup

Phi averagely outperforms the other three baselines by 12.2% (Static), 13.3% (Bias) and 12.9% (PIE).

Related Works

Performance prediction and optimization periodically Only decided the number of threads/active cores

• CPR: Composable Performance Regression for Scalable Multiprocessor – [Benjamin C. Lee etc. MICRO2008]

• FDT: Feedback-Driven Threading Power-Efficient and High-Performance Execution of Multi-threaded Workloads on CMPs– [M. Aater Suleman etc. ASPLOS2008]

Only decided the type of heterogeneous cores• Single-ISA Heterogeneous Multi-core Architectures for Multithreaded

Workload Performance– [Rakesh Kumar etc. ISCA2004]

• Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE)– [Kenzo Van Craeynest etc. ISCA2012]

Conclusion Analytical model for performance prediction

Scale-out speedup Scale-up speedup Overall performance

Phi scheduling Apply for runtime management Return optimal performance

Thanks for Your Attention

Q&A

jun ma, guihai yan, yinhe han and xiaowei li state key laboratory of computer architecture

Documents

diverse scale

overall performance

applicationspecific

heterogeneous cores

ratio of performance

core b

current cores

increasing computing