lucid infrared thermography of thermally-constrained processors · 2019-12-03 · infrared camera...

6
Lucid Infrared Thermography of Thermally-Constrained Processors Hussam Amrouch and J¨ org Henkel Karlsruhe Institute of Technology (KIT), Chair for Embedded Systems (CES), Karlsruhe, Germany {amrouch, henkel}@kit.edu Abstract—Thermal analysis is a prerequisite for developing reliability increasing techniques for thermally-constrained pro- cessors, i.e. processors with a high power density. For that purpose, infrared (IR) camera measurement setups have been deployed with the purpose to provide direct feedback of the impact that thermal mitigation techniques have. To obtain lucid IR images 1 , the IR-opaque cooling must be removed and hence, an alternative IR-transparent cooling needs to be provided to protect the chip. To this end, the majority of state-of-the-art employs an IR coolant liquid to prevent the chip from overheating. The problem is that several aspects like thermal convection may interfere with the measured IR radiations resulting in equivocal IR images. Thus, they decrease the accuracy in a way that leads to incorrectly estimating reliability. Solving this prominent problem, we introduce an IR-transparent cooling that cools the chip from its rear side allowing the camera to perspicuously capture the IR emissions as no additional layer in between impedes the radiation. It maintains the on-chip temperatures within a safe range equivalent to the original heat sink-based cooling. We demonstrate how state-of-the-art inaccurate thermal analysis results in incorrectly estimating reliability. Our setup is the most accurate, least intrusive one that has been both proposed and actually applied to state-of-the-art multi-cores (Intel 45 nm dual-core and 22 nm octa-core). I. I NTRODUCTION With technology scaling, power densities have steadily increased making current and upcoming chips thermally con- strained due to the deleterious impact of elevated temperatures on reliability. For instance, spatial thermal gradients across the chip can seriously jeopardize its reliability due to elec- tromigration and logic failures [6]. Importantly, high on-chip temperatures contribute to faster device aging [1]. This trend demands an accurate thermal analysis as it is a prerequisite for evaluating the efficiency of any applied reliability-increasing technique. Thermal Analysis: To investigate the thermal behavior of a chip, thermal simulations have traditionally been employed. Unfortunately, the estimated thermal maps may suffer from inaccuracy due to the lack of accurate information about the runtime power consumption [2]. Instead, direct thermal measurements can be obtained using the on-chip thermal diode sensors but the total number of sensors is restricted for power and cost reasons. This limitation may lead to a failure in capturing the peak temperature – particularly when the runtime hotspot is located too far from the sensor’s location, which is fixed at design time, as Fig 1 demonstrates. In [5], authors proposed to attach thermal sensors onto the center of proces- sor packaging to measure its temperature. However, similar limitations still exist. Modern chips (e.g., the Intel SCC [9]) distribute many Ring Oscillators (ROs) across the die, that 1 Images captured by a thermal camera under minimal interference with the emitted IR radiation from the measured object. may be utilized as thermal sensors after careful calibrations. However, their susceptibility to voltage noise causes instability as it shown in Fig 2, where jumps in the RO frequency correlate to voltage levels. To tackle the shortcomings of the aforementioned tech- niques, IR thermography setups that employ an IR camera have been deployed to capture high resolution IR images of the die of FPGA-based systems (e.g.,[2]) and ASIC processors (e.g., [3][4]) providing designers with an accurate baseline to compare against. Additionally, it enables them to accurately analyze the potential spatial/temporal thermal gradients that may occur during operation. Challenge behind IR Thermography: It is noteworthy that building such an IR thermography-based measurement setup requires removing the cooling and packaging of the chip to allow the IR emissions to reach the camera’s lens. Unlike FPGA-based processors that can still properly operate after exposing its bare silicon [2], due to their relatively low operating frequency, measuring the temperature of ASIC- based processors presents a more challenging problem as it necessitates building an alternative IR-transparent cooling to allow the IR radiation emitted from the chip to reach the 116 °C 110 °C 105 °C 100 °C 95°C 90°C 85°C 80°C A thermal variation of ~30°C between the generated hotspot and temperature where the thermal diode sensor is located within the studied Virtex-5 FPGA Fig. 1: The thermal image shows that a single fixed thermal diode sensor may not capture a thermal hotspot 1.042e+09 1.044e+9 1.046e+9 1.048e+9 1.05e+9 1.052e+9 1.054e+9 0 50 100 150 200 RO Frequency (Hz/2) RO Frequency 0.992 0.996 1 0 50 100 150 200 Voltage (V) Time (s) Voltage Fig. 2: Voltage dependent RO-based sensor frequency with correspond voltage changes. The measurements show a sensitivity to voltage fluctuations of 1.6 MHz/mV

Upload: others

Post on 27-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

Lucid Infrared Thermography ofThermally-Constrained Processors

Hussam Amrouch and Jorg HenkelKarlsruhe Institute of Technology (KIT), Chair for Embedded Systems (CES), Karlsruhe, Germany

{amrouch, henkel}@kit.edu

Abstract—Thermal analysis is a prerequisite for developingreliability increasing techniques for thermally-constrained pro-cessors, i.e. processors with a high power density. For thatpurpose, infrared (IR) camera measurement setups have beendeployed with the purpose to provide direct feedback of theimpact that thermal mitigation techniques have. To obtain lucidIR images1, the IR-opaque cooling must be removed and hence, analternative IR-transparent cooling needs to be provided to protectthe chip. To this end, the majority of state-of-the-art employsan IR coolant liquid to prevent the chip from overheating. Theproblem is that several aspects like thermal convection mayinterfere with the measured IR radiations resulting in equivocalIR images. Thus, they decrease the accuracy in a way thatleads to incorrectly estimating reliability. Solving this prominentproblem, we introduce an IR-transparent cooling that cools thechip from its rear side allowing the camera to perspicuouslycapture the IR emissions as no additional layer in betweenimpedes the radiation. It maintains the on-chip temperatureswithin a safe range equivalent to the original heat sink-basedcooling. We demonstrate how state-of-the-art inaccurate thermalanalysis results in incorrectly estimating reliability. Our setup isthe most accurate, least intrusive one that has been both proposedand actually applied to state-of-the-art multi-cores (Intel 45 nmdual-core and 22 nm octa-core).

I. INTRODUCTION

With technology scaling, power densities have steadilyincreased making current and upcoming chips thermally con-strained due to the deleterious impact of elevated temperatureson reliability. For instance, spatial thermal gradients acrossthe chip can seriously jeopardize its reliability due to elec-tromigration and logic failures [6]. Importantly, high on-chiptemperatures contribute to faster device aging [1]. This trenddemands an accurate thermal analysis as it is a prerequisite forevaluating the efficiency of any applied reliability-increasingtechnique.

Thermal Analysis: To investigate the thermal behavior of achip, thermal simulations have traditionally been employed.Unfortunately, the estimated thermal maps may suffer frominaccuracy due to the lack of accurate information aboutthe runtime power consumption [2]. Instead, direct thermalmeasurements can be obtained using the on-chip thermal diodesensors but the total number of sensors is restricted for powerand cost reasons. This limitation may lead to a failure incapturing the peak temperature – particularly when the runtimehotspot is located too far from the sensor’s location, which isfixed at design time, as Fig 1 demonstrates. In [5], authorsproposed to attach thermal sensors onto the center of proces-sor packaging to measure its temperature. However, similarlimitations still exist. Modern chips (e.g., the Intel SCC [9])distribute many Ring Oscillators (ROs) across the die, that

1Images captured by a thermal camera under minimal interference with theemitted IR radiation from the measured object.

may be utilized as thermal sensors after careful calibrations.However, their susceptibility to voltage noise causes instabilityas it shown in Fig 2, where jumps in the RO frequencycorrelate to voltage levels.

To tackle the shortcomings of the aforementioned tech-niques, IR thermography setups that employ an IR camerahave been deployed to capture high resolution IR images ofthe die of FPGA-based systems (e.g.,[2]) and ASIC processors(e.g., [3][4]) providing designers with an accurate baseline tocompare against. Additionally, it enables them to accuratelyanalyze the potential spatial/temporal thermal gradients thatmay occur during operation.

Challenge behind IR Thermography: It is noteworthy thatbuilding such an IR thermography-based measurement setuprequires removing the cooling and packaging of the chip toallow the IR emissions to reach the camera’s lens. UnlikeFPGA-based processors that can still properly operate afterexposing its bare silicon [2], due to their relatively lowoperating frequency, measuring the temperature of ASIC-based processors presents a more challenging problem as itnecessitates building an alternative IR-transparent cooling toallow the IR radiation emitted from the chip to reach the

116 °C

110 °C

105 °C

100 °C

95°C

90°C

85°C

80°C

A therm

al variation o

f ~

30°C

betw

een t

he g

enera

ted h

ots

pot

and t

em

pera

ture

where

the

therm

al dio

de s

ensor

is locate

d

within

the s

tudie

d V

irte

x-5

FP

GA

Fig. 1: The thermal image shows that a single fixed thermal diodesensor may not capture a thermal hotspot

1.042e+09

1.044e+9

1.046e+9

1.048e+9

1.05e+9

1.052e+9

1.054e+9

0 50 100 150 200

RO

Freq

uenc

y(H

z/2)

RO Frequency

0.992

0.996

1

0 50 100 150 200

Volta

ge(V

)

Time (s)

Voltage

Fig. 2: Voltage dependent RO-based sensor frequency with correspondvoltage changes. The measurements show a sensitivity to voltagefluctuations of 1.6MHz/mV

Page 2: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

IR Camera

Liquid

Reservoir

Inlet

Liquid

Outlet

Liquid

Pump Cooling

Radiator

Infrared Window (e.g. Sapphire) Infrared

Radiation Liquid Flow

Measured Chip

Two additional layers on

top of the measured chip

Bare Silicon

Thermal Camera

Fig. 3: Liquid-based thermal measurement setup applying an IRcoolant oil on top of the measured chip to prevent overheating

Without Oil Oil layer of 1mm Oil layer of 3mm

(a) Impact of oil thickness on the lucidity of the captured IR images.

The best lucidity of the captured thermal image when the IR camera

directly measures the object without a layer of oil on top of it

72°C

55°C

40°C

30°C

Oil layer of 3mm Oil layer of 3mm

(b) Impact of thermal convection over time on thermal measurements

Heating the oil up results in thermal convection that can interfere with the IR radiation

Oil layer of 3mm

60°C

46°C

40°C

31°C

Time [s] 15 10 5

127Ω resistor

connected to 4V

Fig. 4: Thermal measurement experiments demonstrating the impactof IR mineral oil applied to a hot object during measurements

thermal camera and concurrently counteract the very rapidincrease in temperature due to the excessive power densities.

An intuitive solution may be to reduce the ambient temper-ature, as the chip’s temperature is almost linearly proportionalto it, through putting the entire setup inside a thermal chamber.However, the chip, after removing its cooling, generates heatsignificantly faster than what the surrounding cold air candissipate as the latter has a very low thermal conductivity.We examined based on the thermal HotSpot simulator, theimpact of removing the cooling on the Alpha CPU temperatureand figured out that the peak temperature increases from90◦C (when the cooling is applied) to > 200◦C (when bothpackaging and heat sink are removed). Thus, the ambienttemperature needs to be tremendously reduced to prevent thechip from overheating which may, anyway, be unsustainablefor the board/chip.

To tackle this challenge, state-of-the-art techniques [3][4]apply a layer of IR-transparent liquid on top of the chip.Through a continuous stream of coolant liquid, the setupprevents the chip from damage. Besides the complexity thatcomes with building such a setup, having additional layers ontop of the chip is disadvantageous because of interference withthe IR radiation.

II. STATE-OF-THE-ART IN INFRARED THERMOGRAPHY

Building an IR-transparent heat sink using a coolant liquidflowing directly on top of the bare silicon of the chip necessi-tates choosing a liquid that must be IR transparent to enable theIR emissions to obtrusively reach the camera. An example ofthe used liquid for this purpose is a mineral oil [3][4] specifi-cally designed for IR spectroscopy. By continuously circulatingthe coolant oil, the generated heat from the processor chipcan be dissipated and hence protect it from overheating. Anexternal pump is then needed for the purpose of controllingthe flow speed. To constraint the flow, a window on top of itis also required [4]. Such a window must be an IR windowto allow energy at a specific electromagnetic wavelength topass. In Fig 3, we illustrate the liquid-based thermal setupshowing how additional layers may interfere with the emittedIR radiation.

A. Obstacles behind Capturing Lucid Images

The key drawback of the aforementioned setup is thatseveral aspects interfere with the measured IR radiation whichmay result in equivocal (i.e. non lucid) IR images.

Compatibility: The deployed materials to build the setupparts have diverse IR spectroscopy properties that complicatescompatibility. As a matter of fact, each IR camera operatesat a specific wavelength that covers a corresponding spectrumof electromagnetic radiation. For instance, short wavelengthIR cameras cover a wavelength of 1-3µm and long wave IRcameras cover a wavelength of 8-14µm [10]. The transmissionrange indicates to energy in the region of the electromagneticradiation spectrum. When new layers (e.g., coolant liquid andIR window) are added on top of the chip, their transmissionranges need to be precisely chosen to maintain compatibility. Asapphire IR window, for instance, has a transmission range of0.17 to 5.5µm [10] which makes it compatible only with shortwavelength cameras whereas long wavelength cameras cannotcollect IR data through it because a sapphire window does nottransmit beyond 5.5µm [10]. Similarly, the transmission rangeof the selected oil needs to be compatible with the transmissionranges of the used IR window as well as camera.

Thermal Convection: The IR camera obtains the chip’s ther-mal image by capturing the electromagnetic waves that radiatefrom the chip and directly transport IR radiation through air.When a coolant liquid is applied to the surface of the hot chip,the thermal convection phenomenon starts to appear due to thetransfer of heat from one region to another by the movement ofliquid. This diminishes the IR images lucidity of the measuredobject as Fig 4 clarifies, where similar IR oil to is used incurrent setups is examined.

Complexity: The thickness of the oil layer plays an essentialrole in determining the lucidity of the captured IR images.This can be seen in Fig 4 (a) where thicker oil layer made theimage less lucid. Furthermore, the properties of the appliedoil are additional sources that increase the setup complexity,e.g., its viscosity and flow speed, which both influence thethermal convection. It is also necessary to keep the liquid freefrom pollutants over time that compromise its transparency.It is noteworthy that calibrating the state-of-the-art equivocalIR images to close the inaccuracy gap necessitates, anyway,having the corresponding lucid IR images (i.e. captured imageswithout any layer on top of the chip) as an accurate baselineto calibrate against.

In Summary: State-of-the-art in IR thermography is chal-lenging to build as it is subject to multiple aspects thatinterfere with the emitted IR radiation. That, in turn, negativelyinfluences the accuracy of the measurements.

Page 3: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

Measured real-time

thermal image

Measured real-time

thermal image

Temperature

sensors

Temperature

sensors

Voltage regulator to

control the cooling

Voltage regulator to

control the cooling

An in-house milled metal plate

for the thermal conductivity Rear-side-based cooling

Max ΔT 95 °K

V max 15.5 V

I max 5 A

Max. heating

power 17 W

Dimensions

(A x B x H)

40 x 40 x 7.5

mm

Thermoelectric Device

specifications

Intel octa-core Processor

Chip without packaging

under Measurement

PCB

Thermoelectric

(Peltier) device

Water heat sink

cooling the hot side

of the Peltier device

Water cooling unit 12V pump

Triple-radiator

3x “12cm” fans

Water cooling unit 12V pump

Triple-radiator

3x “12cm” fans

Infrared

camera

Infrared

camera

Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors

Our main contributions within this paper are:(1) We introduce a setup that cools the measured chip from itsrear side to keep its operating temperature within a safe rangeequivalent to the original cooling.(2) Our proposed Rear-side Thermoelectric-based IR Thermog-raphy (RAMA) setup does not require the addition of any extralayer on top of the measured chip allowing the IR camera toperspicuously capture the IR emissions and thereby providinglucid thermal images. We also demonstrate that if the thermalsetup would not be as accurate as the one we built, reliabilitywould be incorrectly estimated.

III. OUR RAMA SETUP FOR LUCID IR IMAGES

To keep the measured chip in operational conditions afterremoving its cooling, we continuously cool it from the rearside, i.e. through the PCB to which the chip is attached asshown in Fig 6. Thermoelectric device is employed as itprovides a controlled source of cooling.

The basic concept behind it is the Peltier effect, which isa phenomenon that occurs when an electrical current flowsthrough two dissimilar semiconductor materials (i.e. N-typeand P-type) [11] resulting in a thermal variance. It heats upone of the thermoelectric sides of the device and cools the otherdue to conduction of heat from one side where it is absorbedto the other side where the heat is released. The generatedheat must be rapidly dissipated from the thermoelectric devicehot side to avoid overheating and to allow for a continuedproper operation. To this end, we use a water-cooling unit.The temperature difference between both thermoelectric devicesides can be controlled by regulating the applied voltage.

IR Camera

Opened chip without

packaging (i.e. bare silicon)

PCB

IR radiation

P

N

P

N

Peltier‘s cold side

Peltier’s hot side Heat Dissipater

Released heat

Thermoelectric Device

Top-Side

Rear-Side

No layer on top of the

measured chip

The camera can directly

and perspicuously capture

the IR radiation IR

Ca

me

ra

Fig. 6: The employment of a thermoelectric in our RAMA setup

Our built thermal measurement setup is depicted in Fig 5.We investigate chips with flip-chip technology, as they allowmechanical removal of the packaging as opposed to e.g., awire-bond chips. This means that the silicon wafer of theprocessor will be exposed after removing the chip packaging.On-chip temperatures are then directly measured using aDIAS pyroview 380L compact IR camera capable of capturingtemperatures with an accuracy of ±1◦C and a spatial resolutionof 50µm per pixel [12]. Once the readings have been obtained,the camera sends them at a frame rate of 50 Hz to a host PCthat analyzes them in order to build the thermal images of themeasured chip.It is noteworthy that our setup is not restricted to a specifickind of IR cameras, unlike state-of-the-art that necessitatesemploying an IR camera operating at a specific wavelengthwhich must be compatible with both IR oil and IR window.

Emissivity: A crucial property to consider when performingIR measurements is the emissivity of the material. It indicatesthe object’s ability to emit IR energy and it ranges form0.0 for a perfect reflector to 1.0 for a perfect emitter. Fig 7

Page 4: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

56°C

52°C

50°C

46°C

42°C

40°C

core2 core1 core2 core1 core2 core1 core2 core1

Ca

ptu

red

IR

im

age

s o

f th

e

ch

ip u

nd

er

an in

ten

se

work

load f

rom

the

beg

innin

g o

f e

xe

cu

tio

n u

ntil

reach

ing t

he

ste

ad

y-s

tate

tem

pera

ture

(O

ur

Se

tup

) core2 core1

9.1mm

9.1

mm

333639424548515457

1

51

10

1

15

1

20

1

25

1

30

1

35

1

40

1

45

1

50

1

55

1

60

1

65

1

70

1

Te

mp

era

ture

[°C

]

Time [s]

HeatSink Our Setup

333639424548515457

1

51

10

1

15

1

20

1

25

1

30

1

35

1

40

1

45

1

50

1

55

1

60

1

65

1

70

1

Te

mp

era

ture

[°C

]

Time [s]

HeatSink Our Setup

333639424548515457

1

51

10

1

15

1

20

1

25

1

30

1

35

1

40

1

45

1

50

1

55

1

60

1

65

1

70

1

Te

mp

era

ture

[°C

]

Time [s]

HeatSink Our Setup

Steady-state temperature

Time [s] 1 8 16 24 32

(a) Comparison of diode sensor

readings of Core2

(b) Comparison of diode sensor

readings of Core1

(c) Comparison of temperature

average of both cores

Steady-state temperature

Steady-state temperature

0

50

10

0

15

0

20

0

25

0

30

0

35

0

400

45

0

50

0

55

0

60

0

65

0

70

0

0

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

450

50

0

55

0

60

0

65

0

70

0 0

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

450

50

0

55

0

60

0

65

0

70

0

Fig. 8: Measured IR images of runtime behavior of Intel dual-core processor under an intense workload stress as well as the correspondingtemperatures of individual cores compared to fabricated/original heat sink-based cooling unit

Ma

skin

g ta

pe

(emissivity 0

.92

)

co

ve

rin

g h

alf o

f th

e

ch

ip s

ho

ws a

ctu

al

tem

pe

ratu

re

Pa

int of

logo h

as

hig

h emissivity (

~0

.9)

ap

pe

ars

ho

tte

r

Metal packaging measurements very inaccurate (emissivity

~0.01). Most heat measured is reflection from surroundings

33°C

35°C

37°C

39°C

41°C

43°C

44°C

Fig. 7: Emissivity test of a measured chip

shows our emissivity test of an FPGA chip with its packaging.There, the left half of the chip appears of relatively lowtemperature due to the low emissivity of the metal. The actualtemperature, however, is much closer to the right coated half ofthe chip. To compensate for this shortcoming, a masking tape(with an emissivity of 0.92 [8]) is applied to the material’ssurface increasing its emissivity in the IR spectrum of thethermal camera. The masking tape was chosen due to itsrelatively high emissivity and it can easily be removed. Themeasurements obtained with the masking tape are then usedfor determining the emissivity of the exposed silicon waferthrough a comparison between the camera image and thechip’s thermal diode readings. Since the emissivity is known,the camera software can internally compensate the missingtemperature for a more accurate reading.

IV. EVALUATION, COMPARISON AND ADVANTAGES

To evaluate our thermal setup, we use an Intel 45 nmdual-core processor along with the CPUBurn benchmark,which is designed to load x86 CPUs as heavily as possiblefor the purposes of maximizing heat production from theCPU and putting an intensive stress on the chip itself aswell as the cooling [13]. Towards evaluating our setup underthe highest possible stress, we operate the two cores at themaximum frequency (1.8 GHz) with hyper-threading enabled.As a result, four CPUBurn programs are simultaneously runwhich maximizes the workload on the entire chip. The steady-state temperature of the chip, under such an intense scenario,reaches around 55◦C which is far from the critical temperature

(80◦C as Intel defines). This shows that our setup protects thechip from damage during IR measurements that are requiredfor performing the desired thermal analysis. Some of thecaptured IR images are shown in Fig 8 which demonstratesthe real-time chip thermal images at different points of timefrom the start of program execution.

Steady-State Temperature Equivalent: Since different cool-ing may have different thermal impact on the chip, it isnecessary to examine in how far our proposed cooling isrepresentative to the original setup, i.e. the IR-opaque fab-ricated cooling unit. Therefore, we directly compare. Fora fair comparison, both processors start at the same initialtemperature (i.e. 30◦C). Then, we concurrently run the sameintense workload (i.e. four simultaneous CPUBurn programs)on both for the same interval of time – around 5 minutes whichis sufficient to reach the steady-state temperature. Duringoperation, the thermal diode sensor readings of each core arerecorded to compare the thermal impact of our setup with theoriginal one. As presented in Fig 8(a, b, c), the steady-statetemperature of each core with our cooling in action is verysimilar to the steady-state temperature in case of the originalcooling2. Under identical conditions with respect to initial tem-perature, steady-state temperature and workload, our built IR-transparent cooling setup for thermal measurements achievesa very good representation of the original cooling in terms ofsteady-state temperatures.It noteworthy that analyzing the long-term effects of temper-ature due to aging on the reliability mainly depends on thesteady-state temperature and not on the transient.

Unlike FPGA-based processors, where the thermal hotspotis located at the memory interface [2], we observe in ASIC-based processors that the thermal hotspot is caused by theprocessor cores themselves, as shown in Fig 8. This ismainly due to the significantly higher power densities ofcores. Therefore, thermal managements need to focus on thecores’ temperatures. For instance, task migration-based thermal

2An online adaptation of the applied voltage to the thermoelectric couldadjust the generated cooling in order to result also in a similar behavior interms of the transient temperatures. However, this was not the focus of thispaper and it is a part of our future work.

Page 5: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

𝐺 𝑐 = 𝑁 𝑤 𝑒𝑖 ∗ 𝐼 𝑐 − 𝐼 𝑒𝑖2

𝑖∈ 𝐸𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡

𝐺 𝑐 ∶ 𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑜𝑓 𝑃𝑖𝑥𝑒𝑙 𝑐 𝐼 𝑐 ∶ 𝐼𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑃𝑖𝑥𝑒𝑙 𝑐 𝑊 𝑒𝑖 : 𝑊𝑒𝑖𝑔ℎ𝑡𝑖𝑛𝑔 𝐹𝑎𝑐𝑡𝑜𝑟 𝑖𝑛𝑣𝑒𝑟𝑠𝑒 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑜𝑓 𝑐, 𝑒𝑖

𝑁:𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝐹𝑎𝑐𝑡𝑜𝑟 = 1

𝑤 𝑒𝑖𝑖

50°C 48°C

46°C

44°C

42°C

40°C

38°C

36°C

34°C

co

re0

co

re1

co

re2

co

re3

co

re4

co

re5

co

re6

co

re7

IR Thermography 10°C/mm

8°C/mm

6°C/mm

4°C/mm

2°C/mm

0°C/mm

Spatial Thermal Gradient Map

Fig. 9: Spatial thermal gradient calculation along with an example of a measured IR image with its calculated thermal gradient map

IR T

he

rmo

gra

phy

2000

4000

8000

0 0 5 10 15 20

[°C/mm]

Sp

atia

l Th

erm

al

Gra

die

nts

His

togra

m

Cou

nt [#

]

43°C

40°C

38°C

35°C

33°C

Th

erm

al C

on

tou

r

2000

4000

6000

8000

0 0 5 10 15 20

[°C/mm]

Cou

nt [#

]

Max

co

re0

co

re1

co

re2

co

re3

co

re4

co

re5

co

re6

co

re7

co

re0

co

re1

co

re4

co

re5

co

re2

co

re3

co

re6

co

re7

43°C

40°C

38°C

35°C

33°C

Oil Impact

State-of-the-art oil-based setup (thin IR oil layer of ~1mm)

45°C

40°C

26°C

36°C

30°C

Our Setup (lucid IR image)

Max

6000

45°C

40°C

26°C

36°C

30°C

Fig. 10: Adding a layer of oil on top of the measured chip losses thecaptured IR images its lucidity, due to the thermal convection, whichjeopardizes the accuracy of spatial thermal gradient analysis

management techniques (e.g., [7]) may be applied. Thus, weinvestigate, in the following, the thermal characteristics whentasks are migrated between cores.

For a more general evaluation, we employ a state-of-the-art Intel 22 nm octa-core processor running at a maximumfrequency of 2.4 GHz. The scenario of running the intenseworkload (i.e. eight CPUBurn programs simultaneously run-ning) has been first examined and it showed that our setupis also able to protect chip from damage during measurementunder such an intense stress (the peak temperature was around75◦C, similar to the steady-state temperature under the originalcooling). Fig 9 presents an example of the measured IR imagesof the chip and the corresponding calculated spatial thermalgradient maps during the scenario of migrating the runningtasks from two cores to others every 10 sec. The gradientmaps have been calculated according to the formula withinFig 9 that provides an integration in order to prohibit wronggradients caused by the pixel resolution of the thermal camera.This is necessary to avoid the noise in pixel values during the

computation of pixelwise difference. The operating conditionsof the employed thermoelectric device in these experiments(Iin = 0.6 A, Vin = 1.5 V) were carefully selected to makeour setup mimic the impact of the original cooling. As it canbe noticed from the specifications in Fig 5, the thermoelectricdevice is not maxing out in its input current and it still hasmore current (around 8.3x) to use. This enables providing awide range of cooling to select the most suitable one for thetargeted chip, e.g. providing a higher cooling to examine moresevere scenarios, that might be generated in other processorswith higher power densities.

Jeopardy of Equivocal IR Images: Fig 10 clarifies the loss inthe IR images lucidity when a thin IR oil layer is added to thechip. Due to the transfer of heat from one region to another bythe movement of liquid, thermal convection interferes with theIR radiation and results in equivocal IR images. Whereas, theIR image captured by our setup is lucid, as the camera perspic-uously captures the IR radiation. Importantly, an equivocal IRimage negatively impacts the spatial thermal gradient analysisas the thermal contour maps3 and histograms establish. Asshown, the oil causes the spatial thermal gradient analysisto appear different from the case of doing the measurementswithout oil. This leads to ambiguous vision for designersregarding the potential reliability degradations that may occurduring lifetime. In this particular scenario within Fig 10, wemeasure a 7◦C variation in terms of the maximum gradient.However, this will be higher when stressing more cores as itcan further heat up the oil.

Impact of Inaccurate Thermal Analysis on Reliability:Spatial thermal gradients result in a temperature variationwithin a small distance. This causes identical circuits within acomponent (e.g., SRAMs of CPU cache) have distinct charac-teristics, which degrades the entire on-chip reliability [6]. Oneof the key reasons behind deploying thermal setups is enablingdesigners to accurately estimate the reliability of on-chipsystems. Therefore, we investigate for the first time under highaccurate gradients maps extracted from lucid thermal imageshow such a variation of 7◦C in estimating the maximum spatialthermal gradient – due to the applied oil – noticeably impactsthe reliability analysis. To this end, the reliability of SRAMs, asa representative example, has been explored due to their totalchip area that may reach more than 70% [14]. Static NoiseMargin (SNM ) that represents the resiliency against noise,Critical Charge (Qcrit) that represents the susceptibility againstradiation, and Read Access Time (RAT ) that represents theability to provide data in time are considered the key reliabilityaspects in SRAMs [1]. We conducted around 1000 simulations

3Thermal contour is a means to represent the thermal gradients where theabsolute temperature difference between two lines is always the same.

Page 6: Lucid Infrared Thermography of Thermally-Constrained Processors · 2019-12-03 · Infrared camera Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors Our main

T = 52°C T = 45°C

Reliability

decrease

0.22 0.24 0.26 0.28 0.30

Static Noise Margin (SNM) [V]

Count

[#]

0

20

60

100

80

40

(a) Noise Susceptibility Analysis

10 15 20 25 30 35 40

Read Access Time (RAT) [ns]

Reliability

decrease

T = 52°C T = 45°C

Count

[#]

0

20

60

80

40

(c) Timing Violations Susceptibility Analysis 100 Critical Charge (Qcrit) [fC]

5.5 6 6.5 7 7.5 8

T = 52°C T = 45°C

Reliability

decrease

20

60

100

80

40

Count

[#]

0

(b) Radiation Susceptibility Analysis

Measurement Error

Critical Charge (Qcrit) [fC]

Fig. 11: Impact of a thermal measurement-error of 7◦C (as the case of current state-of-the-art thermal setups, when a thin layer of oil (1mm)is applied on top of the chip, on jeopardizing the estimation accuracy of the key reliability aspects of 22 nm SRAMs

∆VTH SNM Qcrit RAT0

4

8

12

16

20

6.25

9.0610.75

16.74

Ove

rest

imat

ion

due

toa

mea

sure

men

t-er

ror

of7◦C

[%]

Fig. 12: Impact of state-of-the-art inaccurate thermal analysis onincorrectly estimating aging for a lifetime of 10 years

based upon SPICE for the 22 nm Predictive Technology Model(PTM) for transistor sizing. As shown in Fig 11(a), the temper-ature rise shifts the SNM distribution to the left resulting inlower resiliency against noise. A similar behavior can also beobserved in Fig 11(b) which shows how a temperature increasemakes the SRAMs more susceptible to radiation. In Fig 11(c),the temperature rise shifts the RAT distribution to the rightmaking SRAMs slower. Therefore, thermal analysis basedon equivocal IR images results in incorrectly estimating theSRAMs reliability. When aging analysis comes into play, theaforementioned problem of incorrectly estimating reliability isexacerbated. In Fig 12, we illustrate the jeopardy of a 7◦Cmeasurement-error on overestimating the impact that agingeffects4 may have on the transistors reliability (i.e. ∆VTH ) andthe SRAMs reliability. As shown, inaccurate thermal analysisduring design-time may lead to overestimate the impact ofaging on, for instance, RAT by 16%.

All in all, having lucid thermal images allow designers to ac-curately explore the thermal characteristics of on-chip systemsand thus correctly estimating their reliability.

V. CONCLUSION

We introduced in this work an IR-transparent coolingsetup enabling thermal cameras to perspicuously capture thechip’s thermal images. In addition to the dual-core processorscenario, we explored the octa-core processor which exhibits amore severe scenario due to the higher on-chip power densities.Our RAMA setup overcomes the drawbacks of the state-of-the-art liquid-based cooling setup. In addition to the right thermalsetup, we show in this paper for the first time in how farcritical circuit characteristics for reliability are correlated withtemperature. If the thermal setup would not be as accurate as

4Both Bias Temperature Instability and Hot carrier injection aging phenom-ena have been analyzed based on the presented models in [1]

the one we present, reliability would be incorrectly estimated.As consequence, we believe that we are presenting the mostaccurate simulations of circuit reliability so far, in the scopeof temperature effects, with respect to noise susceptibility,radiation susceptibility and timing violations. In all aspects,capturing lucid thermal images has a significant impact onaccurately estimating reliability.

ACKNOWLEDGMENT

Authors would like to thank Volker Wenzel, Victor vanSanten and Matrin Buchty (CES, KIT) for their valuable help.Experiments in Figs (2, 7) have been conducted in cooperationwith Thomas Ebi (CES, KIT).

REFERENCES

[1] H. Amrouch, V. M. van Santen, T. Ebi, V. Wenzel, and J. Henkel,“Towards interdependencies of aging mechanisms,” in Proceedings ofthe IEEE/ACM International Conference on Computer-Aided Design,ICCAD, 2014, pp. 478–485.

[2] H. Amrouch, T. Ebi, J. Schneider, S. Parameswaran, and J. Henkel,“Analyzing the thermal hotspots in fpga-based embedded systems,” FieldProgrammable Logic and Applications, 23rd International Conferenceon, 2013, pp. 1–4.

[3] F. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau,“Measuring power and temperature from real processors,” Parallel andDistributed Processing, IEEE International Symposium on, 2008, pp. 1–5.

[4] S. Reda, “Thermal and power characterization of real computing de-vices,” Emerging and Selected Topics in Circuits and Systems, IEEEJournal on, vol. 1, no. 2, pp. 76–87, 2011.

[5] Q. Xie, J. Kim, Y. Wang, D. Shin, N. Chang, and M. Pedram, “Dynamicthermal management in mobile devices considering the thermal couplingbetween battery and application processor,” in Proceedings of the Inter-national Conference on Computer-Aided Design, 2013, pp. 242–247.

[6] L. Zhijian, H. Wei, M.R. Stan, K. Skadron, J. Lach, “InterconnectLifetime Prediction for Reliability-Aware Systems,” in Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, vol.15, no.2, pp.159-172, 2007.

[7] T. Ebi, H. Amrouch, and J. Henkel, “COOL: control-based optimizationof load-balancing for thermal behavior,” in Proceedings of the 10thInternational conference on Hardware/software codesign and systemsynthesis, 2012, pp. 255–264.

[8] B. Griffith, D. Tuerler, and H. Goudey, “Infrared Thermography,” Ency-clopedia of Imaging Science and Technology, 2002.

[9] “Intel SCC,” www.intel.com/info/scc.[10] “Infrared Windows,” http://iriss.com/.[11] F. J. DiSalvo, “Thermoelectric cooling and power generation,” Science,

pp. 703–706, 1999.[12] “DIAS Infrared Camera,” http://www.dias-infrared.com/.[13] http://manpages.ubuntu.com.[14] “ITRS,” http://www.itrs.net/.