akbar sharifi , emre kultursay , mahmut kandemir and chita r. das

28
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing End-to-End Memory Access Latency in NoC Based Multicores

Upload: zahi

Post on 22-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Addressing End-to-End Memory Access Latency in NoC Based Multicores. Akbar Sharifi , Emre Kultursay , Mahmut Kandemir and Chita R. Das. Department of Computer Science and Engineering The Pennsylvania State University. Outline. Introduction and Motivation Details of the Proposed Schemes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

Department of Computer Science and EngineeringThe Pennsylvania State University

Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das

Addressing End-to-End Memory Access Latency in NoC Based Multicores

Page 2: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

2

Outline

Introduction and Motivation

Details of the Proposed Schemes

Implementation

Experimental Setup and Results

Page 3: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

3

Target System Tiled multicore architecture

Mesh NoC Shared, banked L2 cache (S-NUCA) MCs

Core

L2 bank

Router

Node

L1

Communication Link

MC MC

MCMC

Page 4: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

4

MC0

MC2 MC3

Components of Memory Latency

Many components add to end-to-end memory access latency

L1

4

5

3

1

2

MC1

RequestMessage

ResponseMessage

L2

Page 5: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

5

End-to-end Memory Latency Distribution

Significant contribution from network

Higher contribution for longer latencies

Motivation Reduce the contribution

from the network Make delays more uniform

150-200

200-250

250-300

300-350

350-400

400-450

450-500

500-550

550-600

600-650

650-700

0100200300400500600700

L1 to L2 L2 to Mem Mem Mem to L2 L2 to L1

Delay Ranges (cycles)

Del

ay (c

ycle

s)

100 200 300 400 500 600 700 800 9000

2

4

6

8

10

12

Delay (cycles)

Fra

ctio

n of

tota

l acc

esse

s

Average

Page 6: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

6

Out-of-Order Execution and MLP OoO execution: Many memory requests in flight Instruction Window

Oldest instruction commits instruction window advances A memory access with a long delay

Block instruction window Performance degradation

Load

-A

Load

-B

Load

-C

Load

-Dmiss

Network

L1-hitL2-hit

Network

Instruction Windowbegin end

miss

Network

com

mit

Page 7: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

7

Memory Bank Utilization

Large idle times More banks more idle times

Variation in queue length Some queues occupied Some queues empty

Motivation Utilize banks better Improve memory performance

MC MC

MCMC R-2

R-1

R-0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.70

0.75

0.80

0.85

0.90

Banks

Idle

ness

Bank 0

Bank 1

Bank 2

Page 8: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

8

Proposed Schemes Scheme 1

Identify and expedite “late” memory response messages Reduce NoC latency component Provide more uniform end-to-end memory latency

Scheme 2 Identify and expedite memory request messages targeting idle

memory banks Improve memory bank utilization Improve memory performance

Page 9: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

9

Scheme 1 Based on first motivation

Messages with high latency can be problematic NoC is a significant contributor Expedite them on the network

Prioritization Higher priority to “late” messages Response (return path) only, why?

Request messages not enough information yet Response messages easier to classify as late

Bypassing the pipeline Merge stages in the router and reduce latency

Page 10: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

10

Scheme 1: Calculating Age Age = “so-far” delay of a message

12 bits Part of 128-bit header flit No extra flit needed (assuming 12-bits available)

Updated at each router and MC locally No global clock needed

Frequency taken into account DVFS at routers/nodes supported

𝑎𝑔𝑒=𝑎𝑔𝑒+(𝑐𝑦𝑐𝑙𝑒𝑠𝑐𝑢𝑟𝑟𝑒𝑛𝑡−𝑐𝑦𝑐𝑙𝑒𝑠𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑒𝑛𝑡𝑟𝑦 )×𝐹𝑅𝐸𝑄𝑀𝑈𝐿𝑇

𝑙𝑜𝑐𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

Page 11: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

11

MC0 MC1

MC3MC2

L2

L1

core-1

Scheme 1: Example

MC1 receives request from core-1 R0 is the response message MC1 updates age field

Adds memory queuing/service Use age to decide if “late”

Mark as “high-priority” Inject into network as “high-

priority”

R0Age

Late

Page 12: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

12

Scheme 1: Late Threshold Calculation Cores

Continuously calculate average round-trip delay: Convert into and then into Periodically send to MCs

MCs Record values Use them to decide if “late”

Each application is treated independently Not uniform latency across

whole-system Uniform latency for each

core/app

100 200 300 400 500 600 700 800 9000.000.050.100.150.200.250.30

round-trip-delayround-trip-delay

Delay (cycles)

Frac

tion

of to

tal

acce

sses

Delay_avg

Delay_so-far-avg

Threshold

100 200 300 400 500 600 700 800 9000.002.004.006.008.00

10.0012.00

round-trip-delay so-far-delay

Delay (cycles)

Frac

tion

of to

tal

acce

sses

Delay_avg

Threshold

Page 13: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

13

Scheme 2: Improving Bank Utilization Based on second motivation

High idleness at memory banks Uneven utilization

Improving bank utilization using network prioritization Problem: No global information available Approach: Prioritize at the routers using router-local information

Bank History Table per-router Number of requests sent to each bank in last T cycles If a message targets an idle bank, then prioritize

Route a diverse set of requests to all banks Keep all banks busy

Page 14: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

14

Network Prioritization Implementation Routing

5 stage router pipeline Flit-buffering, virtual channel (VC) credit-based flow control Messages split into several flits Wormhole routing

Our Method Message to expedite gets higher priority in VC and SW

arbitrations Employ pipeline bypassing [Kumar07]

Fewer number of stages Pack 4 cycles into 1 cycle

BW RC VA SA ST setup ST

Baseline Pipeline Bypassing

Page 15: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

15

Experimental Setup Simulator: GEMS (Simics+Opal+Ruby)

MC MC

MCMC

4x8 Mesh NoC

Core

L2 bank

Router

L1

32KB64B/line3 cycle latency

512KB64B/line10 cycle latency1 bank/node(32 banks total)

32 OoO cores128 entry instruction window64 entry LSQ

5 stage128-bit flit size6-flit buffer size4 VC per portX-Y Routing

DDR-800 Memory Bus Multiplier = 5Bank Busy Time = 22 cyclesRank Delay = 2 cyclesRead-Write Delay = 3 cyclesMemory CTL Latency = 20 cycles16 banks per MC, 4 MCs

Page 16: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

16

Experimental Setup Benchmarks from SPEC CPU2006

Applications categorized based on memory intensity (L2 MPKI) High memory intensity vs. low memory intensity [Henning06]

18 multiprogrammed workloads: 32-applications each Workload Categories

WL 1-6: Mixed (50% high intensity -50% low intensity) WL 7-12: Memory intensive (100% high intensity) WL 13-18: Memory non-intensive (100% low intensity)

1-1 application-to-core mapping Metric

𝑊 h𝑒𝑖𝑔 𝑡𝑒𝑑𝑆𝑝𝑒𝑒𝑑𝑢𝑝=𝑊𝑆=∑ 𝐼𝑃𝐶𝑖 ( h𝑡𝑜𝑔𝑒𝑡 𝑒𝑟 )𝐼𝑃𝐶𝑖 (𝑎𝑙𝑜𝑛𝑒 )

𝑵𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅𝑾𝑺=𝑾𝑺 (𝒐𝒑𝒕𝒊𝒎𝒊𝒛𝒆𝒅 )𝑾𝑺(𝒃𝒂𝒔𝒆𝒍𝒊𝒏𝒆)

Page 17: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

17

Experimental Results

w-1 w-2 w-3 w-4 w-5 w-60.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

Mixed Workloads

Nor

mal

ized

WS

w-7 w-8 w-9 w-10 w-11 w-120.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

High Intensity Workloads

Nor

mal

ized

WS

w-13 w-14 w-15 w-16 w-17 w-180.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

Low Intensity Workloads

Nor

mal

ized

WS

6%11%

10%

10%15%

13% Higher intensity benchmarks

benefit more from Scheme 1 More traffic More “late” messages

w-2 and w-9 degrade Prioritizing some messages

hurts some other messages

Page 18: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

18

Experimental Results Cumulative Distribution of latencies

8 threads of WL-1 90% point delay reduced from ~700 cycles to ~600 cycles

Probability Density Moved from region 1 to region 2 Not all can be moved

155 265 375 485 595 705 815 9250

0.2

0.4

0.6

0.8

1

Total Delay (cycles)

Fra

ctio

n of

tota

l ac

cess

es

1552493434375316257198139070

0.2

0.4

0.6

0.8

1

Total Delay (cycles)

Fra

ctio

n of

tota

l acc

esse

s

100 200 300 400 500 600 700 800 9000.002.004.006.008.00

10.0012.00

Delay (cycles)

Frac

tion

of to

tal

acce

sses New distribution

Region 1Region 2

Fewer accesseswith high delays

Before Scheme-1

After Scheme-1

Page 19: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

19

Experimental Results Reduction in idleness of banks

Dynamic behavior Scheme-2 reduces the idleness consistently over time

1 3 5 7 9 11 13 150.700.750.800.850.90

default Scheme-2

Banks

Idle

ness

1 4 7 10 13 16 190.650.700.750.800.85

default Scheme-2

Time Interval (100k cycles)

Ave

rage

Idle

ness

Page 20: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

20

Sensitivity Analysis System Parameters

Lateness threshold Bank History Table history length Number of memory controllers Number of cores Number of router stages

Analyze sensitivity of results to system parameters Experimented with different values

Page 21: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

21

Sensitivity Analysis – “Late” Threshold Scheme 1: Threshold to determine if a message is late

Default = = Reduced Threshold:

More messages considered late Too many messages to prioritize can hurt other messages

Increased Threshold: Fewer messages considered late Can miss opportunities

w-1 w-2 w-3 w-4 w-5 w-60.9

0.95

1

1.05

1.1

1.15

1.1 x Delay_avg 1.2 x Delay_avg 1.4 x Delay_avg

Mixed Workloads

Nor

mal

ized

WS

Page 22: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

22

Sensitivity Analysis – History Length Scheme 2: History length

History kept at the routers for past T cycles Default value T=200 cycles

Shorter history: T=100 cycles Cannot find idle banks precisely

Longer history: T=400 cycles Less number requests prioritized

w-1 w-2 w-3 w-4 w-5 w-60.900.951.001.051.101.151.20

T=100 T=200 T=400

Mixed Workloads

Nor

mal

ized

WS

Page 23: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

23

Sensitivity Analysis – Number of MCs Less MCs means

More pressure on each MC Higher queuing latency

More late requests More room for Scheme 1

Less idleness at banks Less room for Scheme 2

w-1 w-2 w-3 w-4 w-5 w-60.90

0.95

1.00

1.05

1.10

1.15

4 MC 2 MC

Mixed Workloads

Nor

mal

ized

WS

Slightly higherimprovements with 2 MC

Page 24: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

24

Sensitivity Analysis – 16 Cores

Scheme-1 + Scheme-2 8%, 10%, 5% speedup About 5% less than 32-cores

Proportional with the # of cores Higher network latency More room for our optimizations

w-1 w-2 w-3 w-4 w-5 w-60.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

Mixed Workloads

Nor

mal

ized

WS

w-7 w-8 w-9 w-10 w-11 w-120.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

High Intensity Workloads

Nor

mal

ized

WS

w-13 w-14 w-15 w-16 w-17 w-180.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

Low Intensity Workloads

Nor

mal

ized

WS

Page 25: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

25

Sensitivity Analysis – Router Stages NoC Latency depends on number of stages in the routers

5 stage vs. 2 stage routers Scheme 1+2 speedup ~7% on average (for mixed workloads)

w-1 w-2 w-3 w-4 w-5 w-60.90

0.95

1.00

1.05

1.10

1.15

1.20

5-stage pipeline 2-stage pipeline

Mixed Workloads

Nor

mal

ized

Wei

ghte

d S

peed

up

Page 26: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

26

Summary Identified

Some memory accesses suffer long network delays and block the cores

Banks utilization low and uneven

Proposed two schemes 1. Network prioritization and pipeline bypassing on “late” memory

response messages to expedite them2. Network prioritization of memory request messages to improve

bank utilization

Demonstrated Scheme 1 achieves 6%, 10%, 11% speedup Scheme 1+2 achieves 10%, 13%, 15% speedup

Page 27: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

27

Questions?

Department of Computer Science and EngineeringThe Pennsylvania State University

Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das

Addressing End-to-End Memory Access Latency in NoC Based Multicores

Thank you for attending this presentation.

Page 28: Akbar Sharifi ,  Emre Kultursay ,  Mahmut Kandemir and Chita R. Das

28

References

[1] A. Kumar, L. Shiuan Peh, P. Kundu, and N.K. Jha, “Express Virtual Channels: Towards the Ideal Interconnection Fabric”, in ISCA, 2007

[2] J. L. Henning, “SPEC CPU2006 Benchmark Descriptions”, SIGARCH Comput. Archit. News, 2006