ayŞe tosun 2008800072 cmpe511 computer architecture 25.12.2008 power efficiency and voltage scaling...

AYŞE TOSUN 2008800072

CMPE511COMPUTER ARCHİTECTURE

25.12.2008

Power Efficiency and Voltage Scaling on CMP

OUTLINE

“Compiler Directed Proactive Power Management for Networks”

“Reducing Energy Consumption of On-Chip Network Through a Hybrid Compiler-Runtime Approach” Proposed Approach Results

Energy Consumption Performance Penalty

F.Lİ, G. CHEN, M. KANDEMIR, M.J. IRWIN

Compiler Directed Proactive Power Management for

Networks

Introduction

Parallel computation platforms (on-chip and off-chip) makes power/energy optimization an important issue.

The most common prior approach Hardware-based Reactive

They manage power consumption of network as a dynamic response to message traffic.

They control network power by observing past network traffic activity and estimating future ones.

Disadv Missing important opportunities for saving power Incuring performance penalties due to inaccuracies in

predicting idle and active times.

Motivation

Potential power penalty unnecessarily waiting to ensure the link is idle. necessary transition time of the link from power-down

to power-on state when next request comes.Goal is to eliminate these problems of

reactive power management schemesPROACTIVE andCOMPILER-BASED approach forLOOP INTENSIVE applications, whose data

access and comm. patterns can be captured at COMPILE TIME.

Proposed Approach

Compiler algorithm in deciding1. Which links of network can be turned off2. When they can be turned off3. When to reactive off links

Experiments on embedded on-chip networksExperiments

Effective in reducing network energy More power savings No observable performance penalty

Network Architecture and Hardware Support

MxN mesh architecturepi = ith nodepi and pj are adjacent to each other and they are

connected with links i j and jiAssumption: System runs a single embedded

application, a set of parallel processes, at a given time. Each node executes at most one process. Set of links involved in a connection btw two not-

adjacent nodes are determined by a specific routing algorithm (X-Y routing algorithm).

A message is first passed in x-dim. and then in y-dim.

Network Architecture

To support link shut downTo monitor link

utilization in sender node

•At each clock ticktime-out counter –

•When it reaches zeroTx and Rx are turned off to conserve energy

When 0, link is turned off

When 1, link is shared

When 1: links won’t be used for a long period of time

When 1: sender will re-use link

Network Architecture

Using control flags:1.Controls states of the links.2.Keeps tracks of roads from source to dest. When they are not adjacent to each other.LAST : program can turn off idle links earlier than pure time-out hw mechanism.HOLD and SHARED: program can prevent nodes from turning off links that are still needed and reduces potential power/energy penalties.

High-Level Power Parameters

To manage link power at compiler level:

Compiler turns off a link if it predicts that link will be idle for a time period that is longer than threshold, T.

Compiler SupportSetting Link Control Flags for Link

Turnoffs

To compute iteration sets H and G: Presburger formulas.Finding optimal H and G is very hard if not impossible.

Connection(i,j): set of links usedTargets(i,I): set of nodes i sends msg at iteration I.Links(i,I): links used by i at iter. I. Computed using connection & targets.Use(i,I,q): set of links used by i during iteration I and I+q.

Compiler SupportInserting Pre-activation Instructions

Pre-activation: we can turn on a link that is currently turned off a certain number of cycles before the link is actually used.

In pre-activation, a communication link is activated before it’s needed to escape from incurring the associated reactivation latency.

Example

Discussion

Their approach requires following two conditions to be satisfied: Message routing in the network must be static.

The set of links used to transfer a message from one node to another must be determined at compiler level.

Message passing behavior of embedded application must be predictable at compilation time. Many parallel applications satisfy this req., as they are

array/loop intensive codes with rare conditional flow of execution.

Although they apply the system on a single embedded application, it is not strictly necessary for our approach to be applicable.

Experimental Results

They introduced a link power model and power simulator to compare their approach with hw-based approach.


Power-on Link energy consumed in link power-on states

Reactivation Energy penalty for reactivation

Results of SW are normalizedwrt HW results.

Compiler-based saves more energy, since it shuts down a link proactively.

When it decides to turn off a link, link does not need to wait for some time to turn off.

Since it is proactive instead of reactive, it assess benefits of link turnoff more accurately.

However, it incurs extra energy overheads in processors and switches due to extra instructions and logic circuits.

But they are negligible when compared to others.


Avg. 6.6% latency with HW Since no adaptive routing algorithm Avg network latency penalty of 3.5% due to reactivation delay.

Sensitivity analysis Change mesh size Change data size

G. CHEN, F. Lİ, M. KANDEMIR

Reducing Energy Consumption of On-Chip Network Through A

Hybrid Compiler-Runtime Approach

Introduction

Compiler-runtime approach for reducing power consumption in the context of the NoC based chip multiprocessor (CMP) architectures.

Observation: Same communication patterns across the nodes of a mesh based CMP repeat themselves in successive iterations of a loop nest.

NoC Communication btw different blocks Expandible, reconfigurable to handle different comm. patterns Easily respond to fault conditions where one or more

connections are disabled.Critical issue

Power consumption of NoC Only responsible for up to 36% of overall power consumption

of a SoC.

Introduction

This paper investigates automated compiler support in reducing power consumption of an NoC based two-dimensional mesh architecture that uses a static routing algorithm.

NoC is exposed to a compiler through an interface.Goal: Let the compiler modify the application

source codeGoal: Manage power consumption of

communication links through voltage scaling.Two Stages

StartUp Phase: Gather link usage statistics during execution of first few iterations of given loop nest using hw support

Stable Phase: Use the collection with compiler to reduce link voltage levels (admissible delays) without affecting communication latency

Proposed Approach

Energy savings HW based

Hybrid24.9%38.1%

Performance overhead

HW basedHybrid 8.3% 2.1%

Link Voltage Scaling OptionsOn Chip Mesh Network

Each node consists of a processor, a switch and a small memory module.

Switch: 5 incoming and 5 outgoing portsEach incoming port contains a queue to

buffer msgWhen queue is full, outgoing port is blocked.Switch is used to read/set state of in/out

ports.

Link Voltage Scaling OptionsScaling Link Voltage

Link(i,j) to refer the link between sender Ni and receiver Nj .

Parallel program may consist of multiple connections, links between non-adjacent nodes, during its execution.

This connection may share some communication links.Packets transferred have to contend for the shared

links. Such contentions may increase transmission time of

packets. To take adv of this observation

Calculate voltage of communication links using link throughput- and link slack-based scaling methods to select lower voltage level.

Link Throughput Based Voltage Scaling

Data rate of a link (λij): max. number of data packets that can be transferred from outgoing to incoming ports of this link during a unit of time. Voltage of outgoing and incoming port

Throughput of a link (μij): number of packets forwarded from incoming port of this link to others during a unit period of time. Throughput of a link is limited to throughput of links to

which it is connected. Under heavy traffic, contentions of bottlenecks

can be severe.A link forwarding packets to a bottleneck link can

be congested. Bottleneck link could not accept packets fast enough.


If a link is congested, its input queue will be filled up no more packets can flow into this link until at least one packet

in queue is forwarded to another link. its throughput can be much lower than its data rate.

Therefore, when congestion happens, While bandwidth of bottleneck links is fully utilized, Bandwidth of congested links are underutilized.

Reduce voltage of congested links to conserve energy without significantly degrading the overall performance .

Communication link being congested: queue associated with this link never becomes empty during this period.


Strategy is as follows: if we find that during a given period, queue associated with link i,j never becomes empty and μij< λij then we reduce voltage of link to lowest level v such that

f(v) > μij , f(v):max data rate that a communication link can provide

Reducing data rate of a link that is not congested may hurt performance

So they apply this strategy only to congested links with specified properties.

Link Slack Based Voltage Scaling

For those whose queues may become empty during a given period of time, Opportunities to scale down voltage without

significantly degrading system performance

Hybrid ApproachHardware Support

Structure of a link from switch i to j

Control voltage of the circuit

Control voltage of outgoing port

Registers to count slack, cycles, emptiness of incoming port queues, etc.

Hybrid ApproachCompiler Support

Take a message passing based parallel code as input

Partition each loop nest in each parallel process code into set of voltage scaling regions

Set link voltages upon entering a voltage region

Start up: link usage info is collected individuallyStable: set suitable link voltage levels for different voltage regions.

Determining Voltage Scaling Regions

Voltage scaling region Loops Loop nests such that a communication pattern repeats

itself at every iteration

Partitioning algorithm To put as many loops or loop nests as possible in the

same voltage scaling region To minimize # link voltage changes (overheads)

Communication pattern if two loop nests have different comm patterns, they are not likely

to exercise comm links in the same way


First compute communication patterns for loop nest L

if this pattern is not ε, treat entire loop as single region.

Otherwise, partition to inner loops with different comm. patterns

When partitioning, compute comm. pattern for each inner loop

Put adjacent inner loops with same pattern into same region.


3,5,6,7,8 are inner loopsComm. Patterns are

calculated by counting # send&receive statements in their bodies.

Assume 3,5,6,7 have same pattern

2 encloses the same3 encloses the same

(adjacent to 2)1 and 4 have different

Code Transformation For Voltage Scaling

Start up Data structure called

sampling context is created

Before backward jump of each loop Collect link usage info Store in sampling context

Calculate suitable voltage levels

Stable For each region

Set voltage levels

Experiments

Code level msg optimizations

Additional compilation time overhead was below 30% for all benchmark codes

# voltage levels detected varied btw 3 and 28.

Static code size increase was less than 10%.

Dynamic instruction count is negligible

Custom network simulator

Experiments

Network energy savings 38.1% (hybrid) vs 24.9% (hw).

Hybrid approach Sets voltage levels

proactively. Change the voltage to

suitable level directly.Voltage scaling has costs.Avg performance

degradation 2.1% vs 8.3% respectively.

Hybrid scales voltage based on Max. Throughput #slack cycles for each link

Experiments

Due to modification in application codes Performance and energy

panalties in mesh nodes First table

Energy overheads : all dynamic and leakage overheads that occur in CPUs and memory components.

Avg overhead 1.13% Second table:

Normalized total energy consumption

Reduce by 10.7% vs 4.29%. Hybrid scheme performs

better when all overheads are accounted for

Experiments

Comparison with compiler-based approach and optimal scheme Hybrid takes runtime

communication behavior and catches the opportunities for link reuse

Achieve higher energy savings

Hybrid combined with compiler based link shutdown scheme

Close to optimal Extra network energy in

startup Selection of sub-optimal

voltage levels

QUESTİONS

Thank you.

ayŞe tosun 2008800072 cmpe511 computer architecture 25.12.2008 power efficiency and voltage scaling...

Documents

link power

network power

power consumption of

power efficiency

idle links

set of links

proactive power management

decidingwhich links