jun ma, guihai yan, yinhe han and xiaowei li state key laboratory of computer architecture

22
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li State Key Laboratory of Computer Architecture Institute of Computing Technology, C.A.S. Univ. of Chinese Academy of Sciences

Upload: shadi

Post on 24-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Am phi sbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance. Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li State Key Laboratory of Computer Architecture Institute of Computing Technology, C.A.S. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-coresan analytical performance model for boosting performance

Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li

State Key Laboratory of Computer ArchitectureInstitute of Computing Technology, C.A.S.

Univ. of Chinese Academy of Sciences

Page 2: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Trends in Cloud Computing The increasing computing demands

More massive More diverse High service level agreement(response time, throughput)

The computing platform to meet these demands Multicore to manycore Homogeneous to heterogeneous

Page 3: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Two Orthogonal Ways to Boost Performance Scale-out speedup: explore many cores for higher

thread-level parallelism

Scale-up speedup: explore heterogeneous cores for optimal application-core mapping

Page 4: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Quantifying Scale-out and Scale-up Speedup The overall performance

Type Issue Width ROB SizeCore-A 4 64

Core-B 6 96

Core-C 8 128

Indicate how to improve overall performance of each application.

How to figure out the application-specific scale-out and scale-up speedup?

Page 5: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Amphisbaena: an Analytical Approach to Model Performance

Amphisbaena, or shortly, Modeling the overall performance speedup coming from

two orthogonal ways

I’m I’m

The ratio of performance on target cores to current cores under the same multithreading configuration.

The ratio of performance on target multithreading configuration to current configuration on the same type of cores.

Page 6: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Experimental Setup

cluster-based layoutdistributed, banked LLC

dir ect o ry- based MOESI p ro to col

Page 7: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Scale-out Speedup

– the serial part.– the parallelizable part.– the multithreading penalty.

Page 8: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Observation

– modulating constant.– synchronization waiting cycles

per kilo-instructions(SPKI).– thread number.

– modulating constant.– misses waiting cycles per kilo-

instructions(MPKI).– thread number squared.

Page 9: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

The Details of Multithreading Penalty

Coefficients Value Implementationsa0 1.837e-003 constant a1 0.05312 constant a2 -2.025e-005 constantk0 bias redundant computationsk1 SPKI bottleneck-identifying instructionsk2 MPKI built-in performance counters

offline

online

Page 10: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Alpha Model Accuracy

benchmarks 12phases 50threads 33(1,2,4,6…64)total space 633600samples 600

Our error is under 5% on average, which outperforms the error of Amdahl’s Law with error of 11.4%.

Page 11: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Scale-up Speedup

the frontend: issue width

• W [Big, Small]

the backend: ROB size

• R[Big, Small]

How to predict the CPI on various type of cores?

S B SB

B B S S

C0 C1

C2 C3

Page 12: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Observation

this trend is well approximated by a power law. this trend fits an exponential function well.

Page 13: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

The Details of CPI Model

Coefficients Value Implementationsb0 0.2837 constant b1 1.1675 constant b2 1.8427 constantr bias b0×CPIbase

s memory intensity CPImem/CPIt computing intensity CPIbase/CPICPImem penalty with stalls CPI stack calculationCPIbase penalty without stalls CPI stack calculation

memory intensity.computing intensity.bias.

offline

online

online

Page 14: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Beta Model Accuracy

benchmarks 12phases 50core types 6total space 18000samples 600

Our error is kept below 8% on average, which outperforms the error of PIE with error of 12.2%.

Page 15: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Phi Model Accuracy

benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000samples 1080

The prediction error of overall performance is kept below 12% on average.

Page 16: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Orthogonality Validation

0: mmmityOrthogonal

benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000measured 2268

mmm ,, three measured values.

For most applications, the error about orthogonality is below 5% on average.

Page 17: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Application of Phi Model Using Phi for runtime management

Predict the performance speedup coming from scale-out and scale-up on any other target configurations online.

Invoke scheduling algorithm to figure out the optimal configuration in terms of maximizing performance.

The operating system enables the specified multithreading and application-core mapping.

Page 18: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Phi Scheduling

Dout Dup Phi

“application with higher scale-out speedup should spawn more thread.”

“application with largest scale-up speedup is allocated with the fastest type of cores.”

“decide the thread number to spawn for each application.”

“decide the cores to map for each application.”

“Phi scheduling use the heuristic algorithm to maximize performance.”

function

policy

algorithm

Page 19: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Performance Comparison

Baselines Scale-out Scale-upBias Dout memory-related samplesPIE Dout PIE modelStatic fixed thread number DupPhi Dout Dup

Phi averagely outperforms the other three baselines by 12.2% (Static), 13.3% (Bias) and 12.9% (PIE).

Page 20: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Related Works

Performance prediction and optimization periodically Only decided the number of threads/active cores

• CPR: Composable Performance Regression for Scalable Multiprocessor – [Benjamin C. Lee etc. MICRO2008]

• FDT: Feedback-Driven Threading Power-Efficient and High-Performance Execution of Multi-threaded Workloads on CMPs– [M. Aater Suleman etc. ASPLOS2008]

Only decided the type of heterogeneous cores• Single-ISA Heterogeneous Multi-core Architectures for Multithreaded

Workload Performance– [Rakesh Kumar etc. ISCA2004]

• Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE)– [Kenzo Van Craeynest etc. ISCA2012]

Page 21: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Conclusion Analytical model for performance prediction

Scale-out speedup Scale-up speedup Overall performance

Phi scheduling Apply for runtime management Return optimal performance

Page 22: Jun  Ma,  Guihai  Yan,  Yinhe Han  and  Xiaowei Li State Key Laboratory of Computer Architecture

Thanks for Your Attention

Q&A