revisiting loop tiling for datacenters: live and let...

47
Revisiting loop tiling for datacenters: Live and Let Live Jiacheng Zhao, Huimin Cui, Yalin Zhang, Jingling Xue, Xiaobing Feng Institute of Computing Technology, Chinese Academy of Sciences Work in conjunction with Prof. Jingling Xue, UNSW, Australia ICS’18, Beijing, China, Jun 15, 2018

Upload: others

Post on 16-Jul-2020

25 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Revisiting loop tiling for datacenters:Live and Let Live

Jiacheng Zhao, Huimin Cui, Yalin Zhang, Jingling Xue, Xiaobing FengInstitute of Computing Technology, Chinese Academy of SciencesWork in conjunction with Prof. Jingling Xue, UNSW, AustraliaICS’18, Beijing, China, Jun 15, 2018

Page 2: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Compiler optimization meets datacenter• Compiler optimization

Try to utilize hardware resources efficiently– Multi-level parallelism for ILP and TLP– Instruction scheduling for pipeline– Loop tiling for cache

Common property:– Exclusively enjoy all resources of a target machine– Especially architectural shared resources

• Datacenter era: Mix workloads to attain high utilization

2018/6/15 ICS’2018, Beijing, China 2

Page 3: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Compiler optimization meets datacenter• Loop (cache) tiling:

Locality optimization for matrix computations Tile data for different levels of cache Always tile for entire cache

2018/6/15 ICS’2018, Beijing, China 3

Tiled dataUncontended

Tiled dataCo-located Apps

Page 4: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Compiler optimization meets datacenter• Loop (cache) tiling:

Locality optimization for matrix computations Tile data for target levels of cache Always tile for entire cache

2018/6/15 ICS’2018, Beijing, China 4

Tiled application suffered from co-location

Page 5: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Compiler optimization meets co-location• Loop (cache) tiling:

Locality optimization for matrix computations Tile data for target levels of cache Always tile for entire cache

2018/6/15 ICS’2018, Beijing, China 5

Tile size really matters

0.6

0.8

1.0

1.2

1.4

0 1 2 3 4 5 6 7 8 9 10Normalize

d LLC Miss C

ount

Workload ID (0 for Solo Run)

Static Best Tile Size A

Page 6: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Compiler optimization meets co-location• Loop (cache) tiling:

Locality optimization for matrix computations Tile data for target levels of cache Always tile for entire cache

• Summary: Static best tile size can not deliver optimal performance

when co-running No single tile size can always perform best with different

co-runners• Problem:

How to select tile size in a co-location scenario?2018/6/15 ICS’2018, Beijing, China 6

Page 7: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Goals• Constructing a peer-aware analytical TSS model

Previous TSS model:– Cache– Reuse

TSS Model– Take co-runners into consideration

Work for real machines Without special os/hardware support

2018/6/15 ICS’2018, Beijing, China 7

Cache Capacity

Reuse Patterns

TSS Model

Hardware Program

Co-running Peers

Page 8: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

A tiled GEMM:

2018/6/15 ICS’2018, Beijing, China 8

• Two level 3-D tiling Inner level: for private cache (IB), fixed in our work Outer level: for shared cache (OB)

Page 9: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

A tiled GEMM:

2018/6/15 ICS’2018, Beijing, China 9

jjj

iii

ii

jj kkk

iii

ii

kk jjj

kkk

kk

jj

C A B

RB

RARC

• Three reuse patterns and reused data 𝑅 , 𝑅 , 𝑅 , Descending reuse distance

Page 10: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Intuition

• Tiled application (𝑇 ) + Cache flusher (F) 𝑇 : Multiple reuse patterns F: Tunable pressure on shared

cache• Consider the two consecutive

access of 𝑅

2018/6/15 ICS’2018, Beijing, China 10

𝑅 𝑅 𝑅 𝑅 𝑅

Shared Cache

Core 0 Core 0 Core 0 Core 0

𝑇 𝑭• What will happen as F gradually increases its speed of

fetching data from shared cache?

Page 11: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Intuition

2018/6/15 ICS’2018, Beijing, China 11

Physical Time

𝑅 𝑅 𝑅 𝑅

Uncontended 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Cache

Page 12: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Intuition

2018/6/15 ICS’2018, Beijing, China 12

Physical Time

Uncontended 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Content with FVery slow fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅 𝑅 𝑅

Data fetched by F

Cache

Page 13: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Intuition

2018/6/15 ICS’2018, Beijing, China 13

Physical Time

Content with FVery slow fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Uncontended 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Data fetched by F

Content with FFaster fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅 𝑅

Reuse of 𝑅disappeared

Cache

Page 14: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Intuition

2018/6/15 ICS’2018, Beijing, China 14

Physical Time

Uncontended 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Data fetched by F

Content with FFastest fetching speed

𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C Reuse of 𝑅disappeared

Content with FFaster fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Reuse of 𝑅disappeared

𝑅 Cache

Page 15: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Key Observation• Cache miss as a function of data fetching speed• Ideal Model: A step function

2018/6/15 ICS’2018, Beijing, China 15

Page 16: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Key Observation

• Reuse blocks of TA will be evicted from shared cache in decreasing order of their reuse distances, as F increases its data-fetching speed 𝑹𝑪, 𝑹𝑩, 𝑹𝑨 in order

2018/6/15 ICS’2018, Beijing, China 16

Page 17: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Key Observation

• If 𝑅 is going to be evicted, all other reuse blocks with a larger reuse distance have already been evicted

2018/6/15 ICS’2018, Beijing, China 17

Page 18: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Reuse-pattern-centric approach

• Focusing on each reuse pattern• Two important model parameters:

The number of cache misses if one reuse pattern disappears in shared cache

The breakpoint “speed” after which a reuse pattern disappears in shared cache

2018/6/15 ICS’2018, Beijing, China 18

Page 19: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Outline

Introduction and Backgrounds

Our Key Observation

Our Approach: Peer-aware tiling

Experimental Results

Conclusion

2018/6/15 ICS’2018, Beijing, China 19

Page 20: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Approach: Peer-aware tiling• Modeling each individual reuse pattern

Co-runners modeled as aggregate cache pressure, p– No need to analyze all co-running peers

Decouple cache misses into two parts:– Cache miss in solo run: 𝑚𝑖𝑠𝑠– Extra cache miss when co-running: Δ 𝑅

» Reuse data are evicted as their reuse distance is larger than cache size

» Modeled as a function of cache pressure

• Aggregate all reuse patterns Simply add Δ 𝑅 from all reuse patterns

2018/6/15 ICS’2018, Beijing, China 20

Page 21: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Approach: Reuse-pattern-centric modeling • Modeling individual reuse pattern:

Profiling + Compiler analysis approach Model parameters:

– For TA:ℎ𝑖𝑡 ,𝑚𝑖𝑠𝑠 , 𝑡» Obtained by offline profiling

– For each reuse pattern: 𝑟𝑑 𝑅 , 𝑛 𝑅 , 𝑟𝑐 𝑅» Obtained through compiler analysis

– Platform constants:» Memory/LLC access latency» Shared cache association» Cache size (𝐶)

– Cache flusher: p» L2LinesInRate» Cache Pressure

2018/6/15 ICS’2018, Beijing, China 21

Tile size 𝑡𝑖

Page 22: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Reuse-pattern-centric modeling: Overall

Basically, we haveΔ 𝑅 ℎ𝑖𝑡 𝑅 ∗ 𝜎 𝑅, 𝐹, 𝐶

• where

𝜎 𝑅, 𝐹, 𝐶 10 𝑓𝑝 𝑅 𝑓𝑝 𝐹 𝐶

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

2018/6/15 ICS’2018, Beijing, China 22

ℎ𝑖𝑡 𝑅 ℎ𝑖𝑡 𝑅

ℎ𝑖𝑡 𝑅

Page 23: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Estimating

• Ideally, ℎ𝑖𝑡 𝑅 can be calculated from n R ∗ 𝑟𝑐 𝑅• A more precise approach: distribute hits in solo run

over all reuse patterns

ℎ𝑖𝑡 𝑅𝑛 𝑅 ∗ 𝑟𝑐 𝑅

∑ 𝑛 𝑅 ∗ 𝑟𝑐 𝑅 ∗ ℎ𝑖𝑡

2018/6/15 ICS’2018, Beijing, China 23

ℎ𝑖𝑡 𝑅 ℎ𝑖𝑡 𝑅

ℎ𝑖𝑡 𝑅

Page 24: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Estimating ∗

𝜎 𝑅, 𝐹, 𝐶 10 𝑓𝑝 𝑅 𝑓𝑝 𝐹 𝐶

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 We have:

𝑓𝑝 𝑅 𝑟𝑑 𝑅𝑓𝑝 𝐹 𝑝 ∗ 𝑡

𝑡 is the time duration of two consecutive accesses of R

2018/6/15 ICS’2018, Beijing, China 24

𝑝∗ 𝑅

𝑝∗ 𝑅𝑝∗ 𝑅

Physical time, not logical time

Page 25: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

2018/6/15 ICS’2018, Beijing, China 25

Physical Time

Content with FVery slow fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Uncontended 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

Data fetched by F

Content with FFaster fetching speed 𝑅 𝑅 𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅

𝑟𝑑 𝑅 C

𝑅 𝑅 𝑅 𝑅

Reuse of 𝑅disappeared

Cache

𝑡

Page 26: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Estimating ∗

𝑓𝑝 𝐹 𝑝 ∗ 𝑡• For the outer-most reuse pattern, e.g. 𝑅

𝑡 can be estimated using 𝑡• For other reuse patterns, e.g. 𝑅 , 𝑅

Estimating execution time when all reuse blocks with larger reuse distance have been evicted, details in paper

2018/6/15 ICS’2018, Beijing, China 26

𝑝∗ 𝑅

𝑝∗ 𝑅𝑝∗ 𝑅

Page 27: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Estimating ∗

• We can obtain:

𝑝∗ 𝐶 𝑟𝑑 𝑅𝑡

2018/6/15 ICS’2018, Beijing, China 27

𝑝∗ 𝑅

𝑝∗ 𝑅𝑝∗ 𝑅

Page 28: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Adjustment for real machines

2018/6/15 ICS’2018, Beijing, China 28

Ideal

Real

Page 29: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Adjustment for real machines• Cache is not fully-associative

Cache is “smaller”: – Effective cache size (𝑒𝐶)

𝑒𝐶#

# 2

Based on [Sen+, SIGMETRICS ’13]• Step function to continuous function• A lower bound: 𝑝∗

Corresponding to 𝑒𝐶 𝑝∗ as upper bound

• A tanh function for simulation Inspired by machine learning

2018/6/15 ICS’2018, Beijing, China 29

𝐶𝑒𝐶

Page 30: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Put it all together• Modeling individual reuse pattern:

Extra cache miss for a reuse pattern R when co-running with a cache flusher of pressure p

𝜟 𝑹

𝟎

𝒉𝒊𝒕 𝑹 ∗𝟏𝟐 𝒕𝒂𝒏𝒉 𝒑

𝒑∗ 𝒑∗

𝟐

𝒉𝒊𝒕 𝑹

𝒑 𝒑∗

𝒑∗ 𝒑 𝒑∗

𝒑 𝒑∗

• Aggregate all reuse patterns:

𝒎𝒊𝒔𝒔 𝑻𝑨 𝚫 𝑹𝒊 𝒎𝒊𝒔𝒔𝒔𝒐𝒍𝒐 𝑻𝑨𝒊

2018/6/15 ICS’2018, Beijing, China 30

Page 31: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Approach: A Peer-aware tiling framework• Leverage our model to support two optimization

objectives: Efficiency: Live

– Reduce performance slowdown suffered by TA

– Tile size minimizing shared cache miss count of TA

𝒎𝒊𝒔𝒔 𝑻𝑨 𝚫 𝑹𝒊 𝒎𝒊𝒔𝒔𝒔𝒐𝒍𝒐 𝑻𝑨𝒊

2018/6/15 ICS’2018, Beijing, China 31

Page 32: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Approach: A Peer-aware tiling framework• Leverage our model to support two optimization

objectives: Efficiency: Live

– Reduce performance slowdown suffered of TA

– Tile size minimizing shared cache miss count of TA

Niceness: Let Live– Reduce performance slowdown incurred by TA

– Tile size minimizing shared cache miss frequency of TA» Actually minimizing memory bandwidth consumption

𝑚𝑖𝑠𝑠_𝑓𝑒𝑞 𝑇 𝑚𝑖𝑠𝑠 𝑇𝑡 _

2018/6/15 ICS’2018, Beijing, China 32

Page 33: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Our Approach: Tiling for different obectives• Leverage our model to support two optimization

objectives: Efficiency: Live

– Reduce performance slowdown suffered of TA

– Tile size minimizing shared cache miss count of TA

Niceness: Let Live– Reduce performance slowdown incurred by TA

– Tile size minimizing shared cache miss frequency of TA» Actually minimizing memory bandwidth consumption

We focus on TSS model Parametric tiling by PrimeTile [Hartono+, ICS’09]

2018/6/15 ICS’2018, Beijing, China 33

Page 34: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Outline

Introduction and Backgrounds

Our Key Observation

Our Approach: Peer-aware loop tiling

Experimental Results

Conclusion

2018/6/15 ICS’2018, Beijing, China 34

Page 35: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Evaluation Setup: Methodology• Evaluating efficiency: Live

Optimizing performance of tiled matrix computations Reducing performance slowdown suffered LLC miss count reduction/prediction Accuracy of our prediction model Key observation of optimal tile size Improve performance of library routines

• Evaluation niceness: Let live Optimizing performance of co-located applications Reducing performance slowdown incurred

2018/6/15 ICS’2018, Beijing, China 35

Page 36: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Evaluation Setup: Apps & Platforms• Applications:

Matmul (GEMM) & tmm from Pluto conv from Caffe memcached as a latency-sensitive application with mutilate

being load generator and latency tester SPEC and STREAM as co-runners

• Platforms: Westmere-based Intel six-core Xeon E5645 (main)

• Loop tiling: Two-level loop tiling

– Inner tile size: fixed as 56– Outer tile size: 840 (static best, found by hand-tuning)

» Search space for our approach: [280, 840] with a step of 56 (inner tile size)

2018/6/15 ICS’2018, Beijing, China 36

Page 37: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: Overall Performance of GEMM• Optimizing performance of GEMM when co-locating

• Performance slowdown suffered is reduced by 34.1%, from 22.9% to 15.1% (14.8% for optimal tiling)

2018/6/15 ICS’2018, Beijing, China 37

Page 38: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: LLC Miss Reduction• Benefits come from cache miss reduced

• LLC miss count is reduced by 24%, from 8.24x to 6.24x

2018/6/15 ICS’2018, Beijing, China 38

Page 39: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: LLC Miss Prediction• Precise prediction LLC Miss of GEMM when co-locating

• Prediction error: average 3.9%

2018/6/15 ICS’2018, Beijing, China 39

Page 40: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: Accuracy of Prediction Model• Prediction models for three representative tile sizes

2018/6/15 ICS’2018, Beijing, China 40

Page 41: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: Optimal Tile Size• Optimal tile size is a non-monotone function of co-

runner’s cache pressure

• Aggregate behavior of different reuse patterns

2018/6/15 ICS’2018, Beijing, China 41

Page 42: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: Optimizing Library Routines• Improving performance of ATLAS when co-running

Three configurations– ATLAS: default implementation– Static: ATLAS + hand-tuning outer tiling (fixed tile size)– Peer-aware: Static + our approach (tile sized determined at

runtime) One application

– cblas_dgemm

2018/6/15 ICS’2018, Beijing, China 42

Page 43: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Efficiency: Optimizing Library Routines

• 8.9% improvement over ATLAS• 7.1% improvement over Static

2018/6/15 ICS’2018, Beijing, China 43

Page 44: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Niceness: Optimizing co-running peers• Scenario:

memcached and GEMM co-located together GEMM’s optimization objective is set to “niceness”

95th latency: Solo (228 us)– Static (292 us), 28% slowdown– Peer-aware (271 us), 18.8% slowdown, a reduction factor of 32.9%

2018/6/15 ICS’2018, Beijing, China 44

Page 45: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Outline

Introduction and Backgrounds

Our Key Observation

Our Approach: Peer-aware tiling

Experimental Results

Conclusion

2018/6/15 ICS’2018, Beijing, China 45

Page 46: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Conclusion• A reuse-pattern-centric approach for modeling co-

running cache behavior of tiled matrix computations• An approach for analytically building above model• A peer-aware tiling framework supporting two

optimization objectives: efficiency and niceness, based on our analytical model

2018/6/15 ICS’2018, Beijing, China 46

Page 47: Revisiting loop tiling for datacenters: Live and Let Liveics2018.ict.ac.cn/slides/ICS18-paper72-zhao-slides.pdf · Compiler optimization meets datacenter • Compiler optimization

Thank you.

2018/6/15 ICS’2018, Beijing, China 47