hat:%heterogeneous%adap1ve% … · 2012-10-28 · on9chip%networks% 4 pe&...
TRANSCRIPT
HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks
Kevin Kai-‐Wei Chang Rachata Ausavarungnirun
Chris Fallin Onur Mutlu
Execu1ve Summary Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies
2
Outline • Background and Mo1va1on • Mechanism • Comparison Points • Results • Conclusions
3
On-‐Chip Networks
4
PE
Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE
• Connect cores, caches, memory controllers, etc. – Buses and crossbars are not scalable
PE PE PE …
IInterconnect
On-‐Chip Networks
5
R PE
R PE
R PE
R PE
R PE
R PE
R PE
R PE
R PE
R Router
Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE
• Connect cores, caches, memory controllers, etc – Buses and crossbars are not scalable
• Packet switched • 2D mesh: Most commonly used topology
• Primarily serve cache misses and memory requests
Network Conges1on Reduces Performance
6
Network conges1on: êsystem performance
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
P P P
P
P
Conges1on
R Router Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE
P Packet
Limited shared resources (buffers and links) • due to design constraints: Power, chip area, and @ming
Goal • Improve system performance in a highly congested network
• Observa1on: Reducing network load (number of packets in the network) decreases network conges@on, hence improves system performance
• Approach: Source throFling (temporarily delaying new traffic injec@on) to reduce network load
7
PE PE PE …
Source Thro4ling
8
I
Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE
P Packet I
P P
P
P P
P
P
P P
P
P
P P
P P
Long network latency when the network is congested
PE PE PE …
Source Thro4ling
9
I
Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE
P Packet I
P P
P
P P
P
P
P P
P
P
P P
P P Thro4le Thro4le Thro4le P
• ThroFling makes some packets wait longer to inject • Average network throughput increases, hence higher system performance
P P
P P
P P
Key Ques1ons of Source Thro4ling • Every cycle when a node has a packet to inject, source throFling blocks the packet with probability P – We call P “thro4ling rate” (ranges from 0 to 1)
• Thro4ling rate can be set independently on every node
Two key ques1ons: 1. Which applica@ons to throFle? 2. How much to throFle? Naïve mechanism: ThroFle every single node with a constant throFling rate
10
Key Observa1on #1
11
ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive)
gromacs … mcf
I P
P
P P
P P P
P P P
Thro4le
ThroFling gromacs decreases system performance by 2% due to minimal network load reduc@on
ThroFling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced conges@on
gromacs …
Key Observa1on #1
12
ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive)
mcf
I P
P
P P
P P P
P P P
Thro4le
gromacs …
Key Observa1on #1
13
ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive)
mcf
I P
P P
P P
Thro4le
• ThroFling mcf reduces conges@on • gromacs benefits more from less network latency
• ThroFling network-‐intensive applica@ons reduces conges@on • Benefits both network-‐non-‐intensive and network-‐intensive applica@ons
mcf
Key Observa1on #2 There is no single thro4ling rate that works well for every applica@on workload
Network runs best at or below a certain network load Dynamically adjust throFling rate to avoid overload and under-‐u@liza@on
14
… mcf
I P P
P
P P P
P P
gromacs gromacs
I P P
P
Thro4le 0.8
… Thro4le 0.8 Thro4le 0.8 Thro4le 0.8
Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions
15
Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-‐aware thro4ling:
ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons
2. Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases
16
Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-‐aware thro4ling:
ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons
2. Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases
17
Applica1on-‐Aware Thro4ling 1. Measure applica1ons’ network intensity
2. Thro4le network-‐intensive applica1ons
18
Network-‐non-‐intensive (Unthro4led)
How to select unthroFled applica@ons?
Σ MPKI < Threshold Higher L1 MPKI
Use L1 MPKI (misses per thousand instruc@ons) to es@mate network intensity
• Leaving too many applica@ons unthroFled overloads the network èSelect unthroFled applica@ons so that their total network intensity is less than the total network capacity
Network-‐intensive (Thro4led)
App
Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-‐aware thro4ling:
ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons
2. Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases
19
• Different workloads require different throFling rates to avoid overloading the network
• But, network load (frac@on of occupied buffers/links) is an accurate indicator of conges@on
• Key idea: Measure current network load and dynamically adjust throFling rate based on load if network load > target:
Increase throttling rate else:
Decrease throttling rate
Dynamic Thro4ling Rate Adjustment
20
If network is congested, throFle more
If network is not congested, avoid unnecessary throFling
Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-‐aware thro4ling:
ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons
2. Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases
21
Epoch-‐Based Opera1on • Applica1on classifica1on and thro4ling rate adjustment are expensive if done every cycle
• Solu1on: recompute at epoch granularity
22
Time Current Epoch (100K cycles)
Next Epoch (100K cycles)
During epoch: Every node: 1) Measure L1 MPKI 2) Measure network load
Beginning of epoch: All nodes send measured info to a central controller, which: 1) Classifies applica@ons 2) Adjusts throFling rate 3) Sends new classifica@on and
throFling rate to each node
Pugng It Together: Key Contribu1ons 1. Applica1on-‐aware thro4ling
– ThroFle network-‐intensive applica@ons based on applica@ons’ network intensi@es
2. Network-‐load-‐aware thro4ling rate adjustment – Dynamically adjust throFling rate based on network load to avoid overloading the network
23
HAT is the first work to combine applica@on-‐aware throFling and network-‐load-‐aware rate adjustment
Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions
24
Comparison Points • Source thro4ling for bufferless NoCs
[Nychis+ Hotnets’10, SIGCOMM’12]
– ThroFle network-‐intensive applica@ons when other applica@ons cannot inject
– Does not take network load into account – We call this “Heterogeneous ThroFling”
• Source thro4ling for buffered networks [ThoFethodi+ HPCA’01] – ThroFle every applica@on when the network load exceeds a dynamically tuned threshold
– Not applica@on-‐aware – Fully blocks packet injec@ons while throFling – We call this “Self-‐Tuned ThroFling”
25
Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions
26
Methodology • Chip Mul1processor Simulator
– 64-‐node mul@-‐core systems with a 2D-‐mesh topology – Closed-‐loop core/cache/NoC cycle-‐level model – 64KB L1, perfect L2 (always hits to stress NoC)
• Router Designs – Virtual-‐channel buffered router: 4 VCs, 4 flits/VC
[Dally+ IEEE TPDS’92] • Input buffers to hold contending packets
– Bufferless deflec1on router: BLESS [Moscibroda+ ISCA’09] • Misroute (deflect) contending packets
27
Methodology • Workloads
– 60 mul@-‐core workloads of SPEC CPU2006 benchmarks – 4 network-‐intensive workload categories based on the network intensity of applica@ons (Low/Medium/High)
• Metrics System performance: Fairness: Energy efficiency:
28 €
MaximumSlowdown =maxiIPCi
alone
IPCishared
€
PerfPerWatt =WeightedSpeedup
Power
€
Weighted Speedup =IPCi
shared
IPCialone
i∑
0 5
10 15 20 25 30 35 40 45 50
HL HML HM H AVG
Weighted Speedu
p
Workload Categories
No ThroFling Heterogeneous HAT
Performance: Bufferless NoC (BLESS)
29
1. HAT provides beFer performance improvement than state-‐of-‐the-‐art throFling approaches 2. Highest improvement on heterogeneous workloads -‐ L and M are more sensi1ve to network latency
+ 7.4%
+ 9.4% + 11.5%
+ 11.3%
+ 8.6%
0 5
10 15 20 25 30 35 40 45 50
HL HML HM H AVG
Weighted Speedu
p
Workload Categories
No ThroFling Self-‐Tuned Heterogeneous HAT
Performance: Buffered NoC
30
HAT provides beFer performance improvement than prior approaches
+ 3.5%
0.0
0.2
0.4
0.6
0.8
1.0
1.2
�Buffered
No ThroFling Self-‐Tuned Heterogeneous HAT
Applica1on Fairness
31 HAT provides beFer fairness than prior works
-‐ 15% -‐ 5%
0.0
0.2
0.4
0.6
0.8
1.0
1.2
BLESS
Normalized
Maxim
um Slowdw
on
(Low
er is be4
er)
No ThroFling Heterogeneous HAT
0.0
0.2
0.4
0.6
0.8
1.0
1.2
BLESS Buffered
Normalized
Perf. pe
r Wa4
No ThroFling HAT
Network Energy Efficiency
32
+ 8.5% + 5%
HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered)
Other Results in Paper • Performance on CHIPPER [Fallin+ HPCA’11]
– HAT improves system performance
• Performance on mul1threaded workloads – HAT is not designed for mul@threaded workloads, but it slightly improves system performance
• Parameter sensi@vity sweep of HAT – HAT provides consistent system performance improvement on different network sizes
33
Conclusions Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies
34
HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks
Kevin Kai-‐Wei Chang Rachata Ausavarungnirun
Chris Fallin Onur Mutlu
Thro4ling Rate Steps
36
Network Sizes
37
0
10
20
30
40
50
1.3%
3.9%
1.9%
11.1%4.9%
3.7%
VC-BUFFEREDHAT
0
10
20
30
40
50
Wei
ghte
d S
peed
up(H
ighe
r is
Bet
ter)
5.8%
5.9%
3.3%
6.0%
20.0%
10.4%
0
10
20
30
40
50
16 64 144
Network Size
7.2%
10.4%
6.2%
10.8%
17.2%
9.7%BLESSHAT
0
2
4
6
8
0
2
4
6
8
10
12
14
Unf
airn
ess
(Low
er is
Bet
ter)CHIPPER
HAT
16 64 144 0
2
4
6
8
Network Size
Performance on CHIPPER/BLESS
38
0 4 8
12 16 20 24 28 32 36 40 44 48
HL HML HM H AVG
Weig
hte
d S
peedup
(Hig
her
is B
ett
er)
20.0%
12.3%
25.6%
19.2%20.7%
5.9%
5.9%
6.7%
5.8%5.4%
17.2%
15.4%
24.4%16.9%14.2%
10.4%
8.6%
11.3%
11.5%9.4%
HL HML HM H AVG 0
4
8
12
16
20
Unfa
irness
(Low
er
is B
ett
er)
CHIPPERCHIPPER-HETERO.
CHIPPER-HAT
BLESSBLESS-HETERO.
BLESS-HAT
Injec@on rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER)
Mul1threaded Workloads
39
Mo1va1on
40
6 7 8 9
10 11 12 13 14 15 16
80 82 84 86 88 90 92 94 96 98 100
Weighted Speedu
p
Thro4ling Rate (%)
Workload 1 Workload 2
90%
94%
Sensi1vity to Other Parameters • Network load target: WS peaks at b/w 50% and 65% network u@liza@on – Drops more than 10% beyond that
• Epoch length: WS varies by less than 1% b/w 20K and 1M cycles
• Low-‐intensity workloads: HAT does not impact system performance
• Unthro4led network intensity threshold:
41
Implementa1on • L1 MPKI: One L1 miss hardware counter and one instruc@on hardware counter
• Network load: One hardware counter to monitor the number of flits routed to neighboring nodes
• Computa1on: Done at one central CPU node – At most several thousand cycles (a few percent overhead for 100K-‐cycle epoch)
42