thermal modeling, characterization and management of on ...ece-lshang/papers/shang04dec.pdfthermal...

Thermal Modeling, Characterization and Management of On-chip Networks

Li Shang Li-Shiuan Peh Amit Kumar Niraj K. Jha

Department of Electrical EngineeringPrinceton University, Princeton, NJ 08544

Abstract

Due to the wire delay constraints in deep submicrontechnology and increasing demand for on-chip bandwidth,networks are becoming the pervasive interconnect fabric toconnect processing elements on chip. With ever-increasingpower density and cooling costs, the thermal impact of on-chip networks needs to be urgently addressed.

In this work, we first characterize the thermal profileof the MIT Raw chip. Our study shows networks havingcomparable thermal impact as the processing elements andcontributing significantly to overall chip temperature, thusmotivating the need for network thermal management. Thecharacterization is based on an architectural thermal modelwe developed for on-chip networks that takes into accountthe thermal correlation between routers across the chip andfactors in the thermal contribution of on-chip interconnects.Our thermal model is validated against finite-element basedsimulators for an actual chip with associated power mea-surements, with an average error of 5.3%.

We next propose ThermalHerd, a distributed, collabo-rative run-time thermal management scheme for on-chipnetworks that uses distributed throttling and thermal-correlation based routing to tackle thermal emergencies.Our simulations show ThermalHerd effectively ensuringthermal safety with little performance impact. With Rawas our platform, we further show how our work can be ex-tended to the analysis and management of entire on-chipsystems, jointly considering both processors and networks.

1 Introduction

Chip reliability and performance are increasingly im-pacted by thermal issues. Circuit reliability depends expo-nentially upon operating temperature. Thus, temperaturevariations and hotspots account for over 50% of electronicfailures [29]. Thermal variations can also lead to significanttiming uncertainty, prompting wider timing margins, andpoorer performance. Traditionally, cooling system designis based on worst-case analysis. As heat density and sys-tem complexity scale further, worst-case design becomesincreasingly difficult and infeasible. This has led to recentprocessor designs, such as the Intel Pentium4-M [1] andthe IBM Power5 [6], moving to average-case thermal de-sign and employing run-time thermal management schemesupon occurrence of thermal emergencies.

As we move towards application-specific multiprocessorsystems-on-a-chip (SoCs) [2, 7, 18] and general-purposechip-multiprocessors (CMPs) [16, 26] where a chip is com-posed of multiple processing elements interconnected witha network fabric, system chip temperature becomes an accu-

mulated effect of both processing and communication com-ponents. With networks consuming a significant portion ofthe chip power budget [2, 10, 27], and hence having a sub-stantial thermal impact, understanding the joint thermal be-havior of all on-chip components, both processors and net-works, is the key to achieving efficient thermal design.

Researchers [3, 22, 24] have started addressing ther-mal issues in microprocessors. Unlike centralized micro-processors, networks are distributed in nature, which im-poses unique requirements on thermal modeling and man-agement. Moreover, multiprocessor on-chip systems sharethe distributed nature of on-chip networks. Therefore, wesee addressing of on-chip network thermal issues provid-ing valuable insights for managing entire on-chip systems.For networks, recent studies mainly focus on power issues,including modeling and characterization [10, 15, 28], de-sign [27], and management [11, 20]. Even though bothpower and temperature are a function of the communica-tion traffic, the network thermal and power consumptionprofiles exhibit different run-time behavior. Power-orientedoptimization techniques cannot fully address network ther-mal issues. We need to tackle the thermal impact of on-chipnetworks directly.

In this paper, we first construct an architecture-levelthermal model for on-chip networks. We then build anetwork simulation platform that enables rapid evaluationof network performance, power consumption, and thermalprofile within a single run-time environment. Using thismodel, we characterize the thermal behavior of the on-chipnetworks in the MIT Raw CMP. We then propose Ther-malHerd, a distributed run-time mechanism that dynami-cally regulates network temperature. ThermalHerd consistsof three on-line mechanisms: dynamic traffic prediction,distributed temperature-aware traffic throttling, and proac-tive/reactive thermal-correlation based routing. The perfor-mance of ThermalHerd is evaluated using on-chip networktraffic traces from the UT-Austin TRIPS CMP [16]. Ourresults demonstrate that ThermalHerd can effectively regu-late network temperature and eliminate thermal emergen-cies. Furthermore, ThermalHerd proactively adjusts andbalances the network thermal profile to achieve a lowerjunction temperature, thus minimizing the need for throt-tling and its impact on network performance. Finally, usingRaw as a case study, we extend our work to address the ther-mal issues of on-chip systems, incorporating on-chip pro-cessing elements. A brief summary of our contributions isas follows.• First work targeting run-time thermal impact of on-

chip networks.

• Characterization of the relative thermal impact of com-putation and communication resources in Raw.

• An architectural thermal model and integrated simula-tion platform that facilitates trace-driven performance,power and temperature analysis of networks.

• A run-time thermal management technique for on-chipnetworks that effectively regulates temperature withlittle performance degradation.

• Extension of network thermal characterization andmanagement to entire on-chip systems, jointly consid-ering both processors and networks.

The rest of this paper is organized as follows. In Sec-tion 2, we first use Raw as a motivating case study fordemonstrating the thermal impact of on-chip networks. InSection 3, we describe the thermal modeling methods foron-chip networks. In Section 4, we give details of Ther-malHerd and evaluate its performance in Section 5. In Sec-tion 6, we discuss extending our work from on-chip net-works to on-chip systems. In Section 7, we present priorrelated work and discuss related research issues, and Sec-tion 8 concludes the paper.

2 Motivating Case Study: On-chip NetworkThermal Impact in Raw

We first motivate the need for run-time thermal manage-ment of on-chip networks by studying the thermal impactof the on-chip networks in the MIT Raw CMP.

The Raw chip accommodates 16 tiles connected in a 4x4mesh topology, clocked at 425MHz frequency and a nomi-nal voltage of 1.8V. The die size is 18.2x18.2mm2. In each4x4mm2 tile, the processor is a single-issue MIPS-styleprocessing element consisting of an eight-stage pipelineequipped with 32KB data and 32KB instruction caches.Tiles are connected by four 32-bit on-chip networks, twostatically-scheduled by the compiler through a 64KB switchinstruction memory and two dynamically-routed. The chipis manufactured using the IBM CMOS7SF Cu6 0.18µmtechnology. As shown in Figure 1, it uses the IBM ceramiccolumn grid array (CCGA) package with direct lid attach(DLA) thermal enhancement. The thermal performance ofthe packaging ranges from 9oC/W to 5oC/W under differ-ent air-cooling conditions.

In Raw, the thermal impact of the on-chip network isgoverned by various design issues. Performance require-ments govern the network complexity, and hence the peakpower consumption. The chip layout affects the ther-mal interaction between the network and other computa-tion/storage tiles. The chip temperature is also directly af-fected by cooling solutions, which vary under different ap-plication scenarios and cooling budgets. In this section, wefocus on analyzing the thermal impact of just the on-chipnetwork. Complete thermal characterization for the entirechip will be discussed in Section 6.

In Raw, the two static networks contain the followingcomponents: switch instruction memory, switch pipeline

DLA thermalenhancement

Multi-layerceramic carrier

Silicon die

Air flow

Thermal interfacematerial

C C C C

C C C C

C C C C

C C C C

Figure 1. MIT Raw CMP with cooling package.

logic, crossbars, FIFOs, and wires. The two dynamic net-works are composed of crossbars, FIFOs, wires, and rout-ing control logic. We use Raw Beetle, a validated cycle-accurate simulator [26], to capture the run-time activitiesof different network components. Combining the simula-tion results with the capacitances estimated using an IBMplacement tool by Kim [9], we derive network run-timepower consumption. Based on the chip layout and pack-aging information, we construct an architectural thermalmodel for the on-chip network using the model describedin Section 3. Thermal characteristics, including thermalconductivities and capacitances, and geometric informationof the package materials including the interface material,DLA, ceramic carrier, etc., and air-cooling conditions areprovided by the packaging manual [25]. The ambient tem-perature is set to 25oC 1. Figure 2 shows the chip peak tem-perature contributed by typical-case network power dissi-pation under different air-cooling conditions (linear feet perminute, or lfpm), where, as explained in [9, 10], the typical-case power assumes the activity factor of 0.25 in each ofthe four on-chip networks. With poor air-cooling, the net-work power alone can push the chip temperature up to about90oC. Even at a typical air cooling of 300lfpm, the networktemperature alone can reach a significant 77oC.

Next, we characterize the thermal impact of the on-chipnetwork using three classes of benchmarks provided bythe Raw binary distribution. Details of these benchmarksare available in [26]. fir and stream are stream bench-marks. 8b 10b enc. and 802.11a enc. are bit-level compu-tation benchmarks. gzip and mpeg are ILP computationbenchmarks. Each study is based on the typical (300lfpm)and best (600lfpm) air-cooling conditions. Detailed bench-mark explanation and complete characterization results aregiven in Section 6. As shown in Figure 3, the two streamand two bit-level computations benchmarks lead to a highpeak temperature. This is due to the heavy utilization ofstatic network resources, which have a higher capacitancethan the dynamic network in Raw as they are used as theprimary communication fabric in the chip. gzip and mpegresult in a lower peak temperature due to two reasons. First,the limited parallelism in these benchmarks leads to low re-source utilization. Second, limited by the available com-piler, the traffic is mainly relayed by the dynamic networks.

The above studies demonstrate that on-chip networkscan have a significant thermal impact on overall chip tem-

1Ambient temperature varies across different systems. Relative trendsare preserved for other typical settings.

60

65

70

75

80

85

90

95

100 200 300 400 600

Air flow velocity (lfpm)

Tem

pera

ture

(C) Max. chip temperature

Figure 2. On-chip network thermal analysis I.

0

10

20

30

40

50

60

70

80

fir stream 8b_10b enc. 802.11a enc. gzip mpeg

Chi

p pe

ak te

mpe

ratu

re (C

)

Typical-case air-cooling (300lfpm)Best-case air-cooling(600lfpm)

Figure 3. On-chip network thermal analysis II.

perature. As will be discussed in Section 6, in Raw, thethermal contributions of both processors and networks arecomparable, and can be substantial, depending on applica-tion characteristics. Based on the benchmarks we used withtypical air-cooling conditions, networks or processors alonecan reach peak temperatures of 68.6oC and 77.9oC, respec-tively. When networks and processors are jointly consid-ered, certain stream and bit-level computation benchmarkscan push chip peak temperature up to 104.7oC under typ-ical air-cooling conditions. Therefore, more sophisticatedcooling solutions and thermal management techniques arerequired to guarantee safe on-line operation. On the otherhand, the SPEC and Mediabench benchmarks result in lowpower and thermal impact on both networks and processorsdue to underutilized on-chip resources.

3 Thermal Modeling of On-chip Networks

The cooling package of today’s high-performance on-chip systems is complex due to high power density and strictcooling requirements. To facilitate characterization of chiparchitectures, we need thermal models that can accuratelycapture the characteristics of these sophisticated thermalpackages with only the limited architectural input param-eters available at an early-stage design.

Skadron et al. first proposed an architectural thermalmodel for microprocessors, HotSpot [22], which constructsa multi-layer lumped thermal RC network to model the heatdissipation path from the silicon die through the coolingpackage to the ambient. In HotSpot, the silicon die is par-titioned into functional blocks based on the floorplan ofthe microprocessor, with a thermal RC network connect-ing the various blocks. HotSpot can be readily used tomodel on-chip network routers – with each router withinthe chip floorplan modeled as a block, and a thermal RCnetwork constructed in the same fashion as when the blockswere functional units within a microprocessor. However,certain characteristics of on-chip networks are not accu-rately captured by HotSpot. First, lateral thermal corre-lation is modeled using lateral thermal resistors connect-ing adjacent blocks, and solved using a closed-form ther-mal equation [23]. This equation was originally proposedto model the spreading thermal resistance for a large sym-metric slab area. As the geometric and thermal boundaryconditions of a silicon die do not match the above scenario,thermal correlation tends to be underestimated. The tem-perature of an on-chip router is affected by its own powerconsumption, and that of its neighborhood and also remoterouters. In on-chip networks, the power and thermal impactof each individual router is limited. Inter-router thermal cor-relation plays an important role in shaping the overall chiptemperature profile. Hence, accurately characterizing thespatial thermal correlation is critical for understanding net-work thermal behavior.

Second, the current release of HotSpot does not modelthe thermal impact of metal interconnects. Due to the highthermal resistance of the silicon dioxide and low dielec-tric constant (low-k) materials, the contribution of the inter-connects to overall chip temperature is of increasing con-cern [5] and needs to be considered.

The above issues prompted us to develop an architec-tural thermal model that handles the thermal characteristicsof on-chip networks – one that aptly captures inter-routerthermal correlation and models on-chip link circuitry.

Heat source

Q1 Q2

R1

T1 T2

Q1 Q2

R1 R2

R3

T1 T2

R2

R31 R32 R33

K1

K2

Heat source

Heat source

Heat

dissipationpath

Heat source

R-silicon

R-spreader

R-sink

R-ambient

Figure 4. heat spreading angle and inter-router thermal correlation.

3.1 Inter-router Thermal Correlation Modeling

Our model is based on the notion of the heat spreadingangle – the angle at which heat dissipates through the dif-ferent layers of packaging. In microelectronic packages, theheat flow from the silicon surface to the ambient is three-dimensional – heat dissipates in the vertical direction andalso spreads along horizontal directions, which can be de-scribed by Fourier’s law. The heat spreading angle formsthe basis of calculations of thermal resistances and thermalcorrelations of the thermal RC network (that is constructedin the same fashion as in HotSpot).Heat dissipation path: First, let us analyze the thermal re-sistance of the heat dissipation path of each on-chip router.The heat spreading angle, θ, (see Figure 4) can be estimatedbased upon the ratio of the thermal conductivities of thecurrent packaging layer’s material, k1, and the underlyingpackaging layer’s material, k2 [13]:

θ = tan−1(k1/k2) (1)

Then, as shown in Figure 4, the thermal resistance, R,of a rectangular heat source on a carrier, that includes thethermal spreading effect is [12]:

R =1

2k tan θ(x − y)ln

y + 2L tan θ

x + 2L tan θ

x

y(2)

where x and y are the length and width of the heat source, Lis the height of the carrier, and k is the thermal conductivity.

For each on-chip router i, the thermal resistance, Ri, isthe summation of the thermal resistance of each thermalcomponent along the heat dissipation path, as follows:

Ri = Ri silicon +Ri spreader +Ri sink +Ri ambient (3)

where Ri silicon, Ri spreader , Ri sink , and Ri ambient arethe thermal resistance of the on-chip router through silicondie, heat spreader, heat sink and ambient, respectively.

Both Ri silicon and Ri spreader can be obtained usingEquations (1) and (2). For Ri silicon, the area of theheat source is equal to the size of the on-chip router, andthe heat spreading angle in the silicon die can be deter-mined by the thermal conductivity ratio of the silicon dieand the heat spreader2. For Ri spreader, the area of theheat source is the original router area plus the area expan-sion due to heat spreading in the silicon die, which equals(x+2Lsilicon tan θsilicon)(y +2Lsilicon tan θsilicon). Theheat spreading angle is affected by the thermal conductivity

2If an interface material is used, the heat spreading angle can be de-termined similarly, and Equation (3) should also take thermal resistance inthe interface material into account.

ratio of the heat spreader and heat sink. For Ri sink, theother side of the heat sink is attached to a cooling fan, andthe heat dissipation is based on heat convection. Previouswork has proposed a closed-form thermal equation to ad-dress the heat spreading issue in heat sinks. The thermalresistance of a heat sink can be determined by the followingequation [23]:

Ri sink =1

πka(�τ + (1 − �)

tanh(λτ) + λBi

1 + λBi

tanh(λτ)) (4)

where k is the thermal conductivity of the heat sink, a is thesource radius, �, τ , λ are dimensionless parameters definedin [23], and Bi is a Biot number.

The thermal resistance along the heat convection pathcan be estimated as follows [17]:

Ri ambient =1

hcAs(5)

where hc is the convection heat transfer coefficient and Asis the effective surface area.Inter-block thermal correlation: In general, the steady-state temperature of each location across the silicon die is afunction of the power consumption of all the on-chip heatsources:

T (x, y) =

N∑

k=1

CiPi (6)

where T (x, y) is the temperature at location (x, y) on thesilicon die, N is the total number of on-chip heat sources,Pi is the power consumption of heat source i, and Ci isthe thermal correlation (thermal resistance) between heatsource i and location (x, y).

The thermal correlation among on-chip blocks (routers)can be characterized based on the cooling structure, the heatspreading angle in each cooling package layer, and the inter-router distance. There exists a duality between heat trans-fer and electrical phenomena. The linearized superpositionprinciple in electrical circuits can be extended to thermalcircuits to analyze the inter-router thermal effect. As shownin Figure 4, Q1 and Q2 denote the power consumption oftwo on-chip routers. The corresponding heat dissipationpaths of these two routers are initially separate but will fi-nally merge into one path due to the heat spreading effect.Then, using the linearized superposition principle, for thesetwo routers, the heat dissipation paths can be divided intotwo parts from the merged point – before this point, the heatdissipation paths can be modeled with two separate thermalresistors, R1 and R2. After this point, the heat dissipationpaths are modeled with a shared thermal resistor R3. Thethermal correlation between two heat sources is determinedby the value of R3, which is the thermal resistance of theshared heat dissipation path from the point where the twoheat dissipation paths are merged together to the ambientenvironment. The position of the merged point can be esti-mated based on the heat spreading angle in each packagingmaterial and the inter-router distance. The junction tem-perature of each router, T1 and T2, including inter-routerthermal correlation, can then be estimated using thermal su-perposition:

T1 = Q1R1 + (Q1 + Q2)R3

T2 = Q2R2 + (Q1 + Q2)R3 (7)

14

7 14

70

1.0

2.0

3.0

4.0

5.0

Error (%

)

x dimension y dimensi

on

(a) Estimation error.

/ /1 2 3 4 5 6 7 8 9 1

35 7

9

00.51.01.52.02.53.03.54.04.55.0

Error (%

)

x dimension y dimen

sion

(b) Estimation error.

Figure 5. Thermal model validation againstFEMLAB.

14

7

111

35

79

1160636669727578

8184

87

90

Temperature (C

)

x dimensiony d

imensio

n

(a) Chip thermal profile.

/ /1

47

11 14

711

02468101214

Error (%

)

x dimensiony di

mension

(b) Estimation error.

Figure 6. Thermal model validation againstIBM in-house finite-element based simulator.

3.1.1 Validation

In this section, we discuss validation of our thermal models.FEMLAB, a finite-element based simulator: To evalu-ate the accuracy of our thermal model, we first use a com-mercial finite-element based simulator, FEMLAB [8]. Weassume the silicon die’s dimension is 18mm × 18mm ×0.6mm, the thermal conductivity is 100W/mK, the heatspreader’s dimension is 30mm × 30mm × 1mm, and theheat sink’s dimension is 60mm × 60mm × 6.8mm. Boththe heat spreader and heat sink are assumed to be made ofcopper, whose thermal conductivity is 400W/mK. The sil-icon die is partitioned into 9 × 9 blocks evenly. We set theambient temperature to 25oC.

In the first scenario, we assume the central block, (5,5),is the only heat source and has a 2.5W power consumption.We use this setup to evaluate the accuracy of modeling asingle heat source. Using our model, the temperature distri-bution on the silicon surface varies from 32.6oC to 28.2oC.Block (5,5) is the hottest spot, and the four boundary blockshave the lowest temperature. Figure 5(a) shows the model-ing error of our thermal model against FEMLAB. The er-ror is consistently less than 5% (the average being 2.9%).To evaluate the accuracy of our model in capturing inter-router thermal correlation, we assume three heat sourcessituated at (5,5), (3,5) and (7,5), each with a power con-sumption of 2.5W. Using our model, the temperature variesfrom 39.8oC to 34.7oC. The estimation error of our thermalmodel against FEMLAB is shown in Figure 5(b), where themaximum error is less than 5% (the average being 1.0%).Actual chip with power measurements: Next, we evalu-ate our thermal model against an actual chip design fromIBM [4]. In this design, a 13mm × 13mm × 0.625mmchip is soldered to a multilayer ceramic carrier. This chip isattached to a heat sink and placed inside a wind tunnel. Thepower consumption profile across the silicon die is based onphysical measurements. We use our thermal model to eval-uate the temperature profile of this IBM design. Figure 6(a)

shows the temperature simulation result using our thermalmodel. Comparing it against an in-house finite-elementbased thermal simulator at IBM, Figure 6(b) shows the es-timation errors in the on-chip thermal profile. As shown inthis figure, the maximum error is less than 10%, and the av-erage error is 5.3%. Using our thermal model, the junctiontemperature of the silicon die varies within [70.2, 85.4]oC.The IBM thermal simulator shows the junction temperaturevarying within [73.2, 87.8]oC.

Both validation studies demonstrate the importance ofaccurately modeling the spatial variance of temperatureacross the silicon surface. While our model and the finite-element based simulators show the spatial thermal varianceto be around 15oC, HotSpot [22] reports only 8oC.

3.2 Thermal Modeling of Links

The power consumed in the on-chip link circuitry affectsthe temperature of both silicon and metal layers. As on-chiplinks are typically fairly long, buffers are inserted to reducesignal propagation delay. The inserted buffers not only af-fect network performance and power consumption, but alsoits temperature. Buffers split on-chip links into multiplesegments, with each segment connected to silicon throughtwo vias. In copper processes, vias have much better ther-mal conductivity than the dielectrics and thus serve as ef-ficient heat dissipation paths. To model on-chip link tem-perature effectively, buffer insertion effects need to be con-sidered. Previous research work has calculated the optimallength of interconnect at which to insert repeaters as [14]:

lopt = const

√

r0(c0 + cp)

rc(8)

where r0 and c0 are the effective driver resistance and in-put capacitance for a minimum-sized driver, cp is the outputparasitic capacitance, and r and c are the interconnect resis-tance and capacitance per unit length.

Given the length of link segments, the temperature alongeach segment can be calculated using the following equa-tion [5]:

T (x) = T0 +j2ρL2H

kM(1 −

cosh( xLH

)

cosh( L2LH

)) (9)

where T0 is the underlying layer temperature, j is the cur-rent density through the link segment, and ρ, L and kM arethe resistivity, length and thermal conductivity of the linksegment, respectively. LH denotes the thermal characteris-tic length.

Based on our analysis, the major thermal contribution ofthe link circuitry lies in the silicon (buffers). Due to the lim-ited self-heating power, when the secondary heat dissipationpath from top silicon dioxide and low-k material layers tothe printed circuit board is considered, thermal hotspots arelocated in the silicon layer instead of the metal layers.

3.3 Sirius: Network Thermal Simulation Environ-ment

The thermal model described above is built into a net-work simulation environment, called Sirius, which providesan architecture-level platform for rapid exploration of theperformance, power consumption, and thermal profile ofon-chip networks.

Network model: We use a flit-level on-chip networkmodel, which specifies the topology and resource config-uration of the network. Currently, two-dimensional direct-network topologies are supported. Each router is specifiedwith a pipeline model. Various routing schemes, includingdeterministic, oblivious, adaptive, etc., are integrated.Power model: This model is adapted from Orion [28], anarchitecture-level network power (dynamic/leakage) model.Network power consumption is determined by the networkarchitecture, implementation technology, and traffic activ-ity. The first two parameters are defined in the networkmodel. The last is captured during run-time simulation.Thermal model: This is the model presented in this sec-tion. It is created, based on the network architecture andcooling structure, during compilation.Timing-driven simulator: This is built on top of a timing-driven simulation engine. During on-line simulation, trafficactivities are gathered and fed into the power model to esti-mate network power consumption. This is then fed into thethermal model periodically to estimate the network temper-ature profile. Timing information is also gathered to moni-tor network latency and throughput.

4 ThermalHerd: Distributed, CollaborativeRun-time Thermal Management

The thermal behavior of on-chip networks is inherentlydistributed in nature. Thermal emergencies can occur in dif-ferent locations and change dynamically. In addition, on-chip networks are heavily performance-driven. Here, weexplore run-time management of network temperature, us-ing ThermalHerd, a distributed scheme where routers col-laboratively regulate the network temperature profile andwork towards averting thermal emergencies while minimiz-ing performance impact.

4.1 ThermalHerd: Overall Architecture

The microachitecture of a ThermalHerd router consistsof five key components:

• Temperature monitoring: At run-time, temperaturemonitors, such as thermal sensors or on-line thermalmodels, periodically report the local temperature toeach router, triggering an emergency mode when thelocal temperature exceeds a thermal threshold.

• Traffic monitoring and prediction: ThermalHerd’sthrottling and routing relies on traffic activity countersembedded in each router that facilitates prediction offuture network workload.

• Distributed throttling: Upon a thermal emergency, therouters at the hotspot will throttle incoming traffic ina distributed way, reducing power consumption in theregion, thus cooling the hotspot.

• Reactive routing: In addition, each router reacts byadapting the routing policy to direct traffic away fromthe hotspots, relying on the thermal correlation infor-mation that is exchanged between routers periodically.

• Proactive routing: During normal operation, routerswill proactively shape their routing decisions to reducetraffic to potential thermal hotspots, based on thermalcorrelation information.

4.2 Run-time Thermal Monitoring

There are two different mechanisms that can be used tomonitor network temperature: thermal sensors and on-linethermal estimation. Thermal sensors have been widely usedin high-performance systems. For instance, Power5 [6]employs 24 digital temperature sensors to obtain the chiptemperature profile. Another approach is dynamic thermalmodeling and estimation. In [22], an architecture-level ther-mal model is used to estimate and monitor the temperatureprofile of microprocessors. Since thermal transitions arefairly slow, the computation overhead introduced by on-linethermal estimation is tolerable.

In this work, we focus on dynamic thermal management.Either of the above temperature monitoring techniques canbe used. For thermal sensors, if one sensor per router is notaffordable for large on-chip networks, the network can bepartitioned into regions and multiple adjacent routers withina region could share the same sensor. For the on-line ther-mal modeling based approach, on-line power and tempera-ture models are required. An efficient on-line power esti-mation mechanism for networks is proposed in [20]. Thiscan then be fed into thermal models implemented with dedi-cated hardware or executed on on-chip processing elementsfor temperature estimation.

4.3 Dynamic Traffic Estimation and Prediction

In ThermalHerd, each router is equipped with two setsof traffic counters to dynamically estimate the traffic work-load. Traffic counter, cntlocal, is integrated with the in-jection buffer to monitor locally-generated traffic. Trafficcounter, cntneighbor , is used to monitor the traffic arrivingfrom the neighborhood. Two other counters, cnthis localand cnthis neighbor , are used for traffic bookkeeping. Bothtraffic and bookkeeping counters are updated based on apredefined timing window, Ttraff . Within each Ttraff ,traffic counters, cntlocal and cntneighbor , count the totalamount of incoming flits from the injection and input ports,respectively. At the end of each Ttraff , the traffic informa-tion in the traffic and bookkeeping counters are combinedusing weighted average filtering to eliminate transient traf-fic fluctuations.

4.4 Distributed Traffic Throttling

The key challenges faced in designing a distributed traf-fic throttling mechanism is that it has to effectively regulateoverall network temperature, averting thermal emergencies,while minimizing performance impact.Exploring the design space for distributed throttling: Asshown in Equation (6), the temperature contributed by eachheat source is affected by the thermal correlation, which isdetermined by the cooling structure and distance. Neigh-boring heat sources have more thermal impact than re-mote ones. Therefore, to effectively alleviate the thermalemergency, traffic throttling around the hotspot locationsrequires less throughput reduction, and hence less perfor-mance penalty.

Traffic throttling within a router also affects the powerconsumption of neighboring routers – throttling the routerinjection port decreases the traffic through neighboringrouters; throttling the router input and output ports de-creases the available network bandwidth. We study thepower throttling effect on a 9×9 on-chip network using uni-form random traffic. Router R4,4 is chosen to be the onlytraffic regulation point. Figure 7 shows the power savings of

0

1

2

3

4

5

6

7

8

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05Throttling ratio

Pow

er (W

)

0.00.10.20.30.40.5

0.60.70.80.9

Ratio

Power savings of the network (W)

Power savings of the throttled router (W)

Ratio of power savings

Figure 7. Power impact of localized throttling.router R4,4 and the whole network as router R4,4 throttlesan increasing percentage of incoming traffic (we call thisthe traffic throttling ratio). The black and gray bars showthe power savings of router R4,4 and the whole network, re-spectively. The line shows the ratio of the power savings ofrouter R4,4 to that of the total network. It shows that whenR4,4 is throttling only a small percentage of its arriving traf-fic, most of the power reduction comes from the throttledrouter, and the power reduction in the remaining part of thenetwork is negligible. As R4,4 throttles more and more ofits arriving traffic, the power savings of the local router in-creases, while at the same time, the throttling effect spreadsfrom the local router to its neighborhood, much like trafficcongestion on a local street spreading beyond its initial lo-cation to other roads feeding into this street. Thus, powerreduction at other routers also increases and begins to dom-inate the total power savings, with neighboring routers see-ing a sharper power reduction than remote ones.Distributed throttling in ThermalHerd: Based on theabove observations, we propose the following distributedthrottling mechanism in ThermalHerd.

When a thermal hotspot is detected at the local router Ri,this router begins to decrease the local workload by throt-tling the input traffic. The policy of traffic throttling is con-trolled by an exponential factor k and local traffic estima-tion, as follows:

Quotai = k × (Nhis local + Nhis neighbor), k ≤ 1 (10)

where Quotai is the traffic quota used to control the to-tal amount of workload that is allowed to pass throughthis router. At the beginning of each thermal timing win-dow, a new temperature is reported. If the temperature stillincreases, in the next timing window, the traffic quota isfurther multiplied by k, otherwise, the throttling ratio iskept the same. Thus, the overall traffic throttling ratio,K, equals kn, where n is the number of thermal timingwindows in which the temperature continuously increases.This procedure continues till the thermal emergency is re-moved. Furthermore, each router uses the following policyto split the quota, Quotai, between the traffic injected lo-cally, Quotalocal, and the traffic arriving from the neigh-borhood, Quotaneighbor.

Quotalocal =

{

Nhis local if Nhis local ≤ QuotaiQuotai if Nhis local > Quotai

Quotaneighbor =

{

Quotai − Nhis local if > 00 if ≤ 0 (11)

This policy is biased towards providing enough traffic quotato the traffic generated locally. The rationale behind the pol-icy is as follows. Without enough traffic quota, the locally-generated traffic will be blocked in the injection buffer,

which increases network latency. Traffic from neighbor-ing nodes, on the other hand, can be redirected throughother paths, thus avoiding a performance penalty. Hence,the above-mentioned bias towards local traffic reduces theperformance penalty.

4.5 Thermal-Correlation Based Routing Algo-rithm

Traffic throttling achieves a lower junction temperatureby reducing network traffic and power consumption. Whiledistributed throttling is efficient, it is not without a perfor-mance penalty. Since network thermal hotspots are oftena result of imbalanced traffic, thermal-aware routing algo-rithms that can redirect traffic from throttled routers to min-imize performance penalty and then shape network trafficpatterns suitably will potentially balance the network tem-perature profile and avoid or reduce traffic throttling.

Our routing protocols rely on the thermal informationexchanged within the network. At run-time, when thermalhotspots are detected, routers located at hotspot regions aremarked as hotspots, and send special packets across the net-work. Since temperature transition is a very slow process,a noticeable temperature variation takes at least hundredsof microseconds. The power consumption and delay over-head introduced by these messages are thus negligible. Forinstance, in a 4×4 network, encoding the location of eachhotspot takes four bits. Using a 100µs temperature reportinterval, the communication overhead for each hotspot isabout 1bits/µs. When no thermal emergency occurs, thelocation information with the highest chip temperature isrelayed through the network.

We discuss both proactive and reactive routing protocolsnext.Proactive routing protocol: The proactive routing schemecontinuously monitors the network temperature profile.When the maximum chip temperature is below the thermalemergency limit, it dynamically adjusts traffic to balancethe network temperature profile and reduces the peak tem-perature.Reactive routing protocol: Upon receiving the thermalemergency information, the reactive routing protocol re-places the proactive routing protocol. It tries to steer packetsaway from throttled regions to minimize the performancepenalty due to throttling. In addition, reactive routing triesto balance the network temperature profile to reduce thehotspot temperature, hence the need for throttling.

Both routing schemes use traffic reduction to achievetheir goals. Since non-minimal path routing results in morehops and links being traversed, and thus higher power con-sumption and hence potentially higher junction tempera-tures, we use minimal-path adaptive routing functions.

For both routing schemes, in order to balance the net-work temperature profile, they should choose the rout-ing paths which have minimal thermal correlation withthe regions with the highest temperature. Since the inter-router thermal effect is based on distance, among all of theminimal-path routing candidates, the routing path with thefarthest thermal distance should be chosen. However, thisapproach significantly reduces path diversity and pushestraffic workload towards the coolest boundaries, which canresult in an unbalanced traffic distribution and may alsogenerate new thermal hotspots in the future.

Detailed thermal analysis shows that inter-router thermalcorrelation dramatically decreases with increasing inter-router distance. Furthermore, the thermal correlation with

remote routers is very small. We thus use a thermal corre-lation threshold, α, to select routing candidates. The ther-mal correlation of each routing candidate is determined asin Equation (7). Intuitively, ThermalHerd’s routing proto-col jointly considers thermal correlation and traffic balanc-ing. It tries to eliminate routing paths with a high thermalcorrelation while leaving enough alternative routing candi-dates to balance the network traffic. It does so by pickingpaths where the thermal correlation between the source andevery hop along the path is below the thermal correlationthreshold α. Since we use minimal-path routing, routingcandidates satisfying such a thermal correlation thresholdcriterion may not be available. If so, the candidate routingpath with the least thermal correlation is chosen.

To strike a good balance between temperature and per-formance, in our implementation, we use different thermalcorrelation thresholds between proactive and reactive rout-ing policies. When the maximum network temperature isbelow the thermal emergency limit, to minimize perfor-mance penalty, a less aggressive traffic redirection policy isused for proactive routing (in this work, we set the thermalcorrelation threshold to be equivalent to 2L, in which L isthe physical distance between neighboring routers). When athermal emergency occurs, we set the threshold to be equiv-alent to 4L for reactive routing, since at that time, reducingthe chip temperature is the first-order issue, and also mostof the performance penalty is due to traffic throttling.

5 ThermalHerd EvaluationIn this section, we evaluate the performance of Thermal-

Herd using CMP traffic traces generated from the TRIPSCMP. We use Sirius as the simulation platform, which wasdescribed in Section 3.3. Performance evaluation focuseson the following two major design metrics.Effectiveness of run-time thermal management: Firstand foremost, as an on-line thermal management scheme,ThermalHerd should effectively alleviate thermal emergen-cies and ensure safe on-line operation.

Two temperature-related parameters are introduced here– network thermal emergency threshold and network ther-mal trigger threshold. The former is a hard temperature up-perbound, which is the maximum allowable network tem-perature depending on various factors, such as thermal bud-get and cooling solutions, and timing/reliability issues. Fordifferent systems, the chip temperature typically varies from60oC to 95oC. In mobile applications or application scenar-ios with tight cooling budgets and space, the thermal budgetcould reach 105oC. In other applications, such as super-computing, where cooling cost is not critical, and the meantime to failure of hundreds of parallel computation nodesneeds to be maximized, the thermal budget could be lowerthan 60oC. Here, we set the thermal emergency thresh-old across a wide temperature range. The latter parameter,thermal trigger threshold, is used to activate ThermalHerd.When the chip peak temperature exceeds the thermal trig-ger threshold, distributed traffic throttling is enabled and theproactive routing scheme is replaced by the reactive routingscheme. We set the thermal trigger threshold to 1oC lowerthan the thermal emergency threshold.Network performance impact: On-chip networks havevery tight performance requirements in terms of latencyand throughput. Hence, the performance impact of thermalmanagement should be minimal.

The network performance is evaluated as follows. La-tency spans the creation of the first flit of the packet to ejec-tion of its last flit at the destination router, including source

queuing time and assuming immediate ejection. Networkthroughput is the total number of packets relayed throughthe network per unit time.

For comparison purposes, we also introduce the follow-ing alternative run-time thermal management techniques.• GlobalThermal: When the temperature exceeds the

thermal trigger threshold, all the routers throttle thesame percentage of incoming traffic. GlobalThermalis an approximation of the chip-level thermal man-agement technique in Pentium4-M [1], in which, asthe processor temperature reaches the temperature lim-its, the thermal control circuit throttles the processorclock.

• DistrThermal: This scheme is equipped with thesame distributed traffic throttling technique as Ther-malHerd. However, the thermal-correlation basedrouting algorithms are not enabled. This scheme isused here to differentiate the performance impact be-tween the throttling and routing techniques. Throttlingindividual functional units has also been proposed formicroprocessors – Power5 uses a dual-stage thermalmanagement scheme [6], where in its second stage,temperature reduction is achieved by throttling the pro-cessor throughput via individual functions.

• PowerHerd: It is the only available architecture-levelrun-time power management scheme targeting net-work peak power [20]. Unlike ThermalHerd, Power-Herd is power-aware instead of being thermal-aware.It controls network peak power based on a globalpower budget that mimics the thermal threshold.

5.1 Evaluation Using TRIPS On-chip CMP Traf-fic Traces

We evaluate the performance of ThermalHerd using traf-fic traces extracted from the on-chip operand networks ofTRIPS CMP by running a suite of 16 SPEC and Media-bench benchmarks. Since thermal transition is a slow pro-cess, to study the accumulated impact of ThermalHerd onboth performance and chip temperature, traffic traces gen-erated by different benchmarks are concatenated together,forming a 200ms test trace.

Using Raw [26], TRIPS [16] and the on-chip networkproposed in [7], we define a 5×5 on-chip mesh network, asthe TRIPS network architecture is currently in design stage.We assume a typical cooling solution here – the silicon dieuses flip-chip packaging and is attached to a 15mmx15mmheat spreader and a 30mmx30mm heat sink. The ambi-ent temperature is 25oC. The initial temperature is deter-mined by the average network power consumption over thewhole simulation trace. Figure 8 shows the network peaktemperature without any thermal management. As differ-ent benchmarks have different network workload and trafficpatterns, we can see that network peak temperature variesfrom 70.1oC to 94.0oC along the 200ms simulation.

40

50

60

70

80

90

100

0 40 80 120 160Time (ms)

Tem

pera

ture

(C

) Peak temperature

Figure 8. Network peak temperature using theTRIPS traffic trace.

Table 1. Temperature management.TT (oC) 89 87 85 83 82 79 77 75TA (oC) 86.2 86.1 84.1 82.5 81.3 78.8 76.5 74.7

To evaluate the effectiveness of ThermalHerd in main-taining the chip temperature below the emergency point, wechoose eight thermal emergency thresholds, TT . The sim-ulation results are shown in Table 1. In this table, the firstrow shows these eight thermal emergency thresholds, thesecond row shows the actual network peak temperature, TA,regulated by ThermalHerd. Comparing these two rows, wecan see that using ThermalHerd, network run-time temper-ature is always below the corresponding thermal constraint,which means ThermalHerd can guarantee safe on-line op-eration.

As shown in Table 1, when TT is at or below 87oC,the difference between TT and TA is always less than 1oC,which implies that the network peak temperature has ex-ceeded the corresponding thermal trigger threshold. Hence,both distributed throttling and reactive routing are enabled.At TT = 89oC, the network peak temperature is below thethermal trigger threshold. Therefore, only proactive routingis enabled in this case. As compared to the peak temperature(94.0oC) when ThermalHerd is disabled, proactive routingalone can reduce the peak temperature by 7.8oC throughbalancing of the chip thermal profile.

Figure 9 shows network throughput degradation intro-duced by ThermalHerd under different thermal constraints.The throughput degradation, Thrdeg , is defined as follows.

Thrdeg(t) = (Thrinit(t) − ThrThermal(t))/Thrinit(t)(12)

where ThrThermal(t) and Thrinit (t) are network run-timethroughputs with and without ThermalHerd. From thisfigure, we have two observations. First, when the ther-mal threshold is higher than 84.0oC, throughput degrada-tion introduced by ThermalHerd is less than 1%. There-fore, compared to 94.0oC network peak temperature, Ther-malHerd reduces network peak temperature by 10oC withnegligible performance penalty. This significant improve-ment is achieved by effective proactive and reactive rout-ing schemes in collaboration with efficient distributed traf-fic throttling. Second, as the thermal threshold decreases,throughput degradation increases. This indicates that proac-tive routing alone cannot sufficiently avert the need forthrottling which impacts performance.

As we can see, the proactive routing scheme is an effec-tive mechanism to balance the network temperature profileand reduce the network peak temperature, which reducesthe need for thermally-induced throttling and hence networkthroughput degradation. As discussed in Section 4.5, thisrouting scheme affects network latency. Here, latency over-head, Latovd, is defined as follows.

Latovd(t) = (LatThermal(t) − Latinit(t))/Latinit(t)(13)

where LatThermal(t) and Latinit (t) are network run-timelatencies with and without ThermalHerd.

Figure 10 shows the run-time latency impact of Thermal-Herd. It shows that the latency overhead introduced by theproactive routing scheme is consistently less than 1.2%.

5.2 Comparison of ThermalHerd Against Alter-native Thermal Management Schemes

Next, we seek to isolate the impact of the various featuresof ThermalHerd – distributed throttling, reactive routing,

and proactive routing. Figure 11 shows network through-put degradation under different peak thermal constraints.We compare four schemes: GlobalThermal, DistrThermal(ThermalHerd with only distributed throttling), DistrTher-mal plus reactive routing, and ThermalHerd (Distributedthrottling, reactive and proactive routing). The results showthat under the same peak thermal constraints, DistrThermalis much more efficient than GlobalThermal by selectivelyreducing the traffic with high thermal contribution to ther-mal hotspots. With reactive routing on top of distributedthrottling, the temperature profile is further smoothed andthroughput degradation reduces by more than 2X. Finally,with proactive routing, throughput degradation is negligibleeven at a thermal emergency threshold of 84oC.

We next compare ThermalHerd with PowerHerd. Pow-erHerd is power-aware instead of thermal-aware. It con-trols network power consumption based on a global powerbudget that mimics the thermal threshold. However, underthe same global power budget, different traffic can resultin very different network temperature profiles. In order toguarantee safe on-line operation, PowerHerd has to assumeworst-case power distribution that results in the maximumchip temperature. To obtain such a worst-case power dis-tribution, we partition the TRIPS network trace into slots of10µs and calculate the average network power consumptionfor each time slot. With that, we derive the worst-case aver-age power distribution which results in the maximum chiptemperature. Figure 12 shows the maximum temperatureunder the worst-case power distribution as compared to theactual network peak temperature. We can see that worst-case estimation overestimates network peak temperature byabout 5oC, which implies PowerHerd will begin to throttlenetwork traffic when the network peak temperature is 5oClower than the thermal trigger threshold. Therefore, by tar-geting the temperature directly, ThermalHerd is much moreefficient than PowerHerd.

6 From On-chip Networks to Entire On-chipSystems

On-chip systems, such as SoCs and CMPs, consist ofcomputation and storage elements interconnected by on-chip networks. Therefore, the chip temperature is an accu-

Threshold = 86C

Threshold = 84C

Threshold = 82C

Threshold = 80C

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

0 40 80 120 160Time (ms)

Thr

ough

put d

egra

datio

n (%

)

Threshold = 86CThreshold = 84CThreshold = 82CThreshold = 80C

Figure 9. Throughput degradation.

0.0%

0.2%

0.4%

0.6%

0.8%

1.0%

1.2%

0 40 80 120 160Time (ms)

Lat

ency

ove

rhea

d (

%)

Figure 10. Latency overhead.

GlobalThermalDistrThermal

DistrThermal + reactive routing

ThermalHerd

0.0%1.0%2.0%3.0%4.0%5.0%6.0%7.0%8.0%9.0%

10.0%

84 85 86 87 88 89 90 91 92 93 94Thermal emergency threshold (C)

Th

rou

gh

pu

t d

egra

dat

ion

(%

)

GlobalThermalDistrThermalDistrThermal + reactive routingThermalHer d

Figure 11. Performance impact of throttling.

6065707580859095

100105110

0 40 80 120 160Time (ms)

Tem

pera

ture

(C

) Worst-case estimationActual peak temperature

Figure 12. Worst-case temperature estimationby PowerHerd vs. actual temperature profile.

mulated effect of the thermal interactions across all on-chipcomponents. The relative thermal contribution of the differ-ent components varies depending on the chip architectureas well as application scenarios.

The chip architecture determines the complexity of pro-cessing vs. storage vs. communication elements andthus the peak power consumption of these elements. Achip with complex processing elements (e.g., wide-issue,multi-threaded) will require larger storage elements (e.g.,large multi-level caches, register files) as well as sophis-ticated communication elements (e.g., multi-level, widebuses, networks with wide link channels, deeply-pipelinedrouters and significant router buffering). On the other ex-treme, there are chip architectures where processing ele-ments are single ALUs serviced by a few registers at ALUinput/output ports, interconnected with simple single-stagerouters with little buffering.

Application characteristics dictate how the above ele-ments are used, thus influencing the power and thermal pro-file of the chip. Essentially, the amount of computation andcommunication per data bit affects the relative power andthermal contribution of processing, memory and networkelements. Here, we use the MIT Raw chip as a platformfor analyzing the absolute and relative thermal impact ofall components of a chip. The Raw chip, with its single-issue processing elements, 32KB caches and registers pertile, and networks with fairly narrow 32-bit channels, 8-stage pipelines, and limited router buffering sits in the mid-dle of two possible architectural extremes – a chip whereprocessing elements are fat multi-threaded cores vs. onewhere processing elements are single ALUs, such as TRIPScores.

6.1 Thermal Characterization of the Raw Chip

In Raw, the chip temperature is affected by both on-chipprocessors and networks. In each tile, the processor poweris mainly due to instruction and data caches, ALU, regis-ter file, fetch, and control logic. For each of these compo-nents, the capacitance is obtained from the estimates froman IBM placement tool [9]. Its run-time activities are cap-tured using the Raw Beetle simulator. Based on the chiplayout, we extend our network thermal model to on-chip

processors by representing each functional component as athermal block and construct a lumped thermal RC networkcovering all the power-consuming components in the Rawchip. In Raw, switch memories, and instruction/data cachesare synchronous SRAM modules that initiate read opera-tions every clock cycle. Raw proposed a power-aware tech-nique allowing individual SRAM lines to be disabled whennot in use. Here, we characterize the chip temperature inboth the power-aware and non-power-aware modes.

As shown in Table 2, the current Raw binary distributioncontains three sets of benchmarks – SPEC and Mediabench,stream computations, and bit-level computations.

We choose benchmarks from all the three categories andstudy the thermal behavior of Raw 3. In order to revealthe thermal impact of the run-time activities of the bench-marks themselves, we first assume the power-aware fea-ture is supported – switch memories are only active whenstatic networks are used; instruction caches are only ac-tive during processor execution; data caches are accessedby both load and store operations. Figure 13 shows thethermal characterization of Raw using these benchmarks.For each benchmark, we consider both typical (300lfpm)and best (600lfpm) air-cooling conditions. The chip peaktemperature is further characterized under three differentpower dissipation scenarios – processor power only, net-work power only, and processor plus network power. Theseresults highlight the following observations:• First and foremost, the chip temperature is the joint

contribution of all on-chip components. As we cansee, among all the benchmarks, the processors or net-works alone may not tip chip temperature over the ther-mal emergency point. However, together, the networksand processors can push the chip peak temperature tohigher than 100oC.

• The chip temperature is the result of thermal correla-tions between all on-chip components. For each on-chip component, its thermal contribution is affected byits own power consumption as well as thermal corre-lation with other components. The former varies witharchitecture and applications. The latter is determinedby the cooling package and physical distance.

• Different benchmarks demonstrate different thermalbehavior. Among the three sets of benchmarks, boththe stream and bit-level computation benchmarks ex-hibit excellent scalability – they can effectively utilizeon-chip parallel computation and communication re-sources, leading to a high chip temperature. Due tothe limitation of the available compiler (rgcc) and theavailable ILP in the programs, the ILP in the SPECand Mediabench benchmarks cannot be explored ef-ficiently. Therefore, these benchmarks result in theon-chip resources being underutilized, thus having alower thermal impact. The thermal impact of networks

3As thermal behavior is similar within a benchmark group, we arbi-trarily picked just 2-3 benchmarks in each group to highlight differencesacross groups.

Table 2. Benchmarks provided by the Raw bi-nary distribution.

Benchmark set Description of benchmarksSPEC & Mediabench Targets instruction-level parallelism (ILP)

Stream Targets real-time I/OBit-level computations Targets comparisons with FPGAs and ASICs

vs. processors also varies for the different benchmarks.For instance, in 802.11a enc., 8b 10b enc., and fir, dueto the high utilization of the static networks and lowaccess rate of on-chip data caches, the networks aloneresult in a comparable or higher temperature than theprocessors. On the other hand, stream results in highutilization of its processing resources; fft only uses thedynamic network partially. In these two cases, the pro-cessor thermal impact is more significant.

Figure 14 shows the chip peak temperature withoutpower-awareness in the SRAM modules. Here, on-chipmemories result in a significant power and thermal over-head.

6.2 Extending ThermalHerd to Thermal Manage-ment of Entire On-Chip Systems

The observations from the previous section highlight thatcoordinating and regulating the behavior of all on-chip com-ponents is the key to achieving effective thermal manage-ment for the entire chip. Since on-chip systems are dis-tributed in nature, the distributed, collaborative nature ofThermalHerd lends readily to the thermal management ofthe entire chip. We discuss the extension of the two keyfeatures of ThermalHerd next – distributed throttling andthermal-correlation based routing.Distributed joint throttling: For Raw, effective thermalregulation requires jointly considering both the networksand processors. We extend the distributed traffic throttlingmechanism in ThermalHerd to distributed joint throttling ofprocessing, storage and network elements in Raw. When thetemperature monitors flag a thermal emergency, the hotspottile begins to throttle both the processor (processing andmemory elements) and the network by controlling the grantsignal of crossbar, instruction issue logic, and memory dis-able signals.Thermal-correlation based placement: Thermal emer-gencies are often a result of an unbalanced temperature pro-file. Therefore, workload migration can potentially bal-ance the temperature profile and reduce the peak temper-ature. The thermal-correlation based routing scheme pro-posed in Section 4.5 targets network traffic migration. For

0

20

40

60

80

100

120

fir(I)

fir(II) fft(

I)fft(

II)

stream

(I)

stream

(II)

8b_10b

enc.(I

)

8b_10b

enc.(I

I)

802.11

a enc.(I

)

802.11

a enc.(I

I)vpr

(I)vpr

(II)gzi

p(I)gzi

p(II)

adpcm

(I)

adpcm

(II)mp

eg(I)

mpeg(

II)

Tem

per

atu

re (

C) Processor+network

Processor aloneNetwork alone

(I): 300lfpm (II): 600lfpm

Figure 13. Thermal characterization of Raw.

0

20

40

60

80

100

120

fir(I)

fir(II) fft(

I)fft(

II)

stream

(I)

stream

(II)

8b_10b

enc.(I

)

8b_10b

enc.(I

I)

802.11

a enc.(I

)

802.11

a enc.(I

I)vpr

(I)vpr

(II)gzi

p(I)gzip

(II)

adpcm

(I)

adpcm

(II)mp

eg(I)

mpeg(

II)

Tem

pera

ture

(C

) (I): 300lfpm (II): 600lfpm

Figure 14. Peak temperature of Raw for vari-ous benchmarks, without power-awareness inSRAMs.

processors, previous work has explored computation mi-gration in CMPs to improve the performance with a mi-gration interval in the range of tens of microseconds [21],which matches the thermal time constant of on-chip thermalhotspots. Therefore, computation migration can be used totrack and balance run-time thermal variations for on-chipcomputation resources – scheduler/OS dispatches computa-tion jobs based on the thermal correlation matrix of process-ing elements to balance the run-time chip thermal profile.Preliminary investigations: Concurrent tasks running on aparallel architecture, such as Raw, are more or less logicallycorrelated. The inter-tile program correlation has a signif-icant impact on run-time thermal management. First, dis-tributed throttling minimizes performance degradation byonly throttling those tiles that have the highest thermal im-pact locally. However, the inter-tile program correlationspreads the localized throttling effect to other tiles, thus de-grading the throttling efficiency and overall performance.Second, to balance the chip thermal profile, task placementneeds to avoid adjacent tiles to minimize the inter-tile ther-mal impact. This increases not only the latency but also thepower and thermal overhead for supporting data communi-cation among correlated tasks.

We explore the advantages and limitations of the abovethermal management techniques. We design syntheticbenchmarks to emulate two typical CMP applications:• Benchmark I: Tasks running on different tiles are inde-

pendent, which emulates server-like workloads – on-chip resources support multiple independent applica-tions for different users.

• Benchmark II: All tiles form a tightly coupled com-putation pipeline stream, which emulates multimediastreaming applications.

As we can see, these two benchmarks are the two ex-treme cases in terms of inter-tile program correlation.

First, we evaluate the performance of distributed jointthrottling (DJT). We compare it with global throttling (GT),in which when a thermal emergency occurs, all the tiles arethrottled in the same fashion. Figure 15 shows the perfor-mance degradation for these two techniques. For GT, eachtile throttles 5% to 30% of system throughput, which is de-fined as the overall finished workload. The x-axis showsthe corresponding chip peak temperature reduction. Asshown in the figure, for Benchmark I, to achieve the sametemperature reduction, DJT results in much lower perfor-mance degradation by more efficiently throttling traffic onthose tiles that have a higher thermal impact on thermalhotspots. As the required temperature reduction increasesfurther, more tiles need to be throttled in order to achieveenough temperature reduction. Hence, the gain of DJT re-duces. For Benchmark II, since all the tiles are tightly cor-related, throttling any individual tile results in the same per-formance impact on all the other tiles. Therefore, DJT hasthe same performance impact as GT.

To evaluate the impact of thermal-correlation basedplacement on inter-tile program correlation, for each bench-mark, we reduce the number of parallel tasks to four. Asshown in Figure 16, for both benchmarks, initially, fourtasks are placed in the four center tiles in Raw. To bal-ance the thermal profile, these four tasks are moved to thecorner tiles. For Benchmark I, task reallocation effectivelybalances the chip thermal profile and reduces the chip peaktemperature by 13.8%. For Benchmark II, task reallocationalso reduces the chip thermal gradient. However, the ex-tra communication power results in a significant power andthermal overhead, and increases the overall chip tempera-ture. As a result, the chip peak temperature increases by

6.9%. In this case, placing the tasks as in Figure 16 achievesa temperature reduction of 3.5% by reducing the inter-taskthermal correlation and avoiding the extra communicationoverhead.

7 Discussion and Related WorkWe next present discussions and related work.

Interconnection networks: Substantial research has ex-plored power consumption issues in interconnection net-works. Patel et al. [15] first developed power modelingtechniques for routers and links. Wang et al. [28] devel-oped an architecture-level power model, called Orion, forinterconnection networks. In our work, we have integratedOrion into our network evaluation platform. For design op-timization, most prior works used circuit-level techniquesto improve the power efficiency of the link circuitry, suchas low-swing on-chip links. Recently, power-efficient on-chip network design has also been addressed at the mi-croarchitecture level [27]. Power-aware techniques, such asdynamic voltage scaling and dynamic power management,have been proposed [11, 19] to minimize the power con-sumption in the link circuitry. All the above techniques tar-get average power, not peak power.

Recently, a dynamic power management scheme, calledPowerHerd [20], has been proposed to address networkpeak power issues. However, PowerHerd is power-awareinstead of being thermal-aware. Even though power andtemperature are correlated, they are still fundamentally dif-ferent in nature. Thus, as demonstrated in the experimentalresults section, PowerHerd cannot address network thermalissues efficiently.Microprocessors: Power and thermal concerns in mod-ern processors have led to significant research efforts inpower-aware and temperature-aware computing. Brookset al. [3] first proposed a dynamic thermal managementscheme. Skadron et al. [22] further explored control-theoretic techniques for this purpose. They also developedan architecture-level thermal model, called HotSpot. Inour work, we constructed a thermal model for on-chip net-works, which is an improvement over HotSpot. Recently,Srinivasan et al. [24] used a predictive dynamic thermalmanagement scheme targeting multimedia applications.

0

5

10

15

20

25

30

35

3.6 7.2 10.8 14.4 18 21.6Peak temperature reduction (C)

Thr

ough

put d

egra

datio

n (%

)J Benchmark I (GT)Benchmark I (DJT)Benchmark II (GT)Benchmark II (DJT)

Figure 15. Distributed joint throttling vs.global throttling.

(a) Benchmark I(A) (b) Benchmark I(B) (c) Benchmark II(A) (d) Benchmark II(B) (e) Benchmark II(C)

Figure 16. Thermal-correlation based place-ment.

8 Conclusions

Power and cooling costs are critical constraints in high-performance on-chip systems. With networks replacing on-chip buses and becoming the pervasive on-chip interconnec-tion fabric in SoCs and CMPs, we need to seriously addressnetwork thermal issues to guide on-chip network design andimprove its thermal efficiency.

In this work, we built an architecture-level thermalmodel, and constructed an architecture-level platform forjointly evaluating the network performance, power, andthermal profile. We then characterized the computationand communication thermal impact in the MIT Raw CMPand revealed the importance of jointly considering both net-works and processors for efficient thermal design of on-chip systems. To overcome the deficiencies of designing forthe worst-case thermal signature, we proposed a distributedon-line thermal management scheme, called ThermalHerd,which can dynamically regulate the network temperatureprofile and guarantee safe on-line operation. Finally, us-ing Raw as a testbed, we explored extensions of our workto address the thermal issues for entire on-chip systems.

This work is the first study addressing thermal issuesin on-chip networks. We hope this work will lead tothermal-efficient network design, enabling network design-ers to build temperature-aware high-performance networkmicroarchitectures. In this paper, we also illustrated howthe distributed, collaborative nature of on-chip networksleads to distinct similarities with distributed on-chip sys-tems and showed how thermal management of on-chip net-works can be extended to entire on-chip systems. We seethis work forming the foundation for studies of completenetworked on-chip processing systems in the future.

Acknowledgments

The authors would like to thank Kevin Skadron and Wei Huangof University of Virginia for their help in our understanding ofHotSpot. We wish to thank the MIT Raw group, especiallyMichael B. Taylor, for the help with the Raw simulation plat-form as well as providing us with critical packaging and powerinformation of their chip. We also wish to thank Doug Burger ofUniversity of Texas at Austin for supplying us with TRIPS net-work traffic traces. In addition, we wish to thank Howard Chenof IBM for helping us validate our thermal model. This workwas supported in part by Princeton University’s Wallace MemorialHonorific, National Science Foundation under Grant No. CCR-0237540 (CAREER) and CCR-0324891, as well as the MARCOGigascale Systems Research Center.

References

[1] Mobile Intel Pentium 4 processor – M datasheet. http://www.intel.com.

[2] L. Benini and G. De Micheli. Networks on chips: A newSoC paradigm. IEEE Computer, 35(1):70–78, Jan. 2002.

[3] D. Brooks and M. Martonosi. Dynamic thermal man-agement for high-performance microprocessors. In Proc.Int. Symp. High Performance Computer Architecture, pages171–182, Jan. 2001.

[4] H. Chen. IBM, personal communication.[5] T.-Y. Chiang, K. Banerjee, and K. C. Saraswat. Analytical

thermal model for multilevel VLSI interconnects incorporat-ing via effect. IEEE Electron Device Letters, 23(1):31–33,Jan. 2002.

[6] J. Clabes et al. Design and implementation of the POWER5microprocessor. In Proc. Int. Solid-State Circuits Conf.,pages 56–57, Jan. 2004.

[7] W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks. In Proc. Design AutomationConf., pages 684–689, June 2001.

[8] FEMLAB User Manual. http://www.femlab.com.[9] J. S. Kim. Power consumption of Raw networks. Technical

report, CSAL/LCS, MIT, 2003.[10] J. S. Kim, M. B. Taylor, J. Miller, and D. Wentzlaff. En-

ergy characterization of a tiled architecture processor withon-chip networks. In Proc. Int. Symp. Low Power Electron-ics and Design, pages 424–427, Aug. 2003.

[11] E. J. Kim et al. Energy optimization techniques in clusterinterconnects. In Proc. Int. Symp. Low Power Electronicsand Design, pages 459–464, Aug. 2003.

[12] D. Meeks. Fundamentals of heat transfer in a multilayersystem. Microwave Journal, 1(1):165–172, Jan. 1992.

[13] N. B. Nguyen. Properly implementing thermal spreadingwill cut cost while improving device reliability. In Proc. Int.Symp. Microelectronics, pages 385–386, Oct. 1996.

[14] R. H. Otten and R. K. Brayton. Planning for performance. InProc. Design Automation Conf., pages 122–127, June 1998.

[15] C. Patel, S. Chai, S. Yalamanchili, and D. Schimmel. Power-constrained design of multiprocessor interconnection net-works. In Proc. Int. Conf. Computer Design, pages 408–416,Oct. 1997.

[16] K. Sankaralingam et al. Exploiting ILP, TLP, and DLP withthe polymorphous TRIPS architecture. In Proc. Int. Symp.Computer Architecture, pages 422–433, June 2003.

[17] J. E. Sergent and A. Krum. Thermal Management Handbookfor Electronic Assemblies. McGraw-Hill Press, 1998.

[18] M. Sgroi et al. Addressing the system-on-a-chip intercon-nect woes through communication-based design. In Proc.Design Automation Conf., pages 667–772, June 2001.

[19] L. Shang, L.-S. Peh, and N. K. Jha. Dynamic voltage scal-ing with links for power optimization of interconnection net-works. In Proc. Int. Symp. High Performance Computer Ar-chitecture, pages 79–90, Feb. 2003.

[20] L. Shang, L.-S. Peh, and N. K. Jha. PowerHerd: Dynam-ically satisfying peak power constraints in interconnectionnetworks. In Proc. Int. Conf. Supercomputing, pages 98–108, June 2003.

[21] K. A. Shaw and W. J. Dally. Migration in single chip multi-processors. Computer Architecture Letters, 1, 2002.

[22] K. Skadron et al. Temperature-aware microarchitecture. InProc. Int. Symp. Computer Architecture, pages 1–12, June2003.

[23] S. Song, S. Lee, and V. Au. Closed-form equation for ther-mal constriction/spreading resistances with variable resis-tance boundary condition. In Proc. Int. Electronics Pack-aging Conf., pages 111–121, May 1994.

[24] J. Srinivasan and S. V. Adve. Predictive dynamic thermalmanagement for multimedia applications. In Proc. Int. Conf.Supercomputing, pages 109–120, June 2003.

[25] M. B. Taylor. MIT, personal communication.[26] M. B. Taylor et al. Evaluation of the Raw microprocessor:

An exposed-wire-delay architecture for ILP and streams. InProc. Int. Symp. Computer Architecture, June 2004.

[27] H.-S. Wang, L.-S. Peh, and S. Malik. Power-driven designof router microarchitectures in on-chip networks. In Proc.Int. Symp. Microarchitecture, pages 105–116, Nov. 2003.

[28] H.-S. Wang, X.-P. Zhu, L.-S. Peh, and S. Malik. Orion: Apower-performance simulator for interconnection networks.In Proc. Int. Symp. Microarchitecture, pages 294–305, Nov.2002.

[29] L.-T. Yeh and R. C. Chu. Thermal Management of Mi-croelectronic Equipment: Heat Transfer Theory, AnalysisMethods, and Design Practices. ASME Press, New York,NY, 2002.

thermal modeling, characterization and management of on ...ece-lshang/papers/shang04dec.pdfthermal...

Documents