novel graphite-based tim for high performance …...v abstract a new class of compressible graphite...
TRANSCRIPT
v
ABSTRACT
A new class of compressible graphite thermal interface
material (TIM) was installed, tested, and qualified for use in
a new series of IBM high performance computing (HPC)
systems. Each system incorporated several high power
graphics processing unit (300W GPU) assemblies in which a
graphite TIM provided direct contact between bare die GPU
devices and either air cooled heatsinks or water cooled cold
plates. GPU hardware was removed from the system and
exposed to a battery of thermal and mechanical stress tests,
and reinstalled for in-system power age and power cycle tests,
to quantify TIM reliability within a 3-year service life. After
the GPUs were subjected to thermal-mechanical tests, which
spanned accelerated thermal cycling (ATC), deep thermal
cycling (DTC), thermal chip shock, temperature/humidity
exposures, and system shock/vibration tests, the components
were periodically reinstalled into systems to monitor power
stability and to assess thermal reliability. For in-system tests,
continuous power and thermal monitors were incorporated for
all power cycle/power age regimens. Control groups of GPUs
mounted with conventional grease-based TIMs were exposed
to the same battery of thermal-mechanical and in-system
server tests. All GPU hardware used for testing shared a
common mounting design for TIM and cooling hardware
attachments that provided a constant spring force clamping
mechanism over the GPU bare die device area while enabling
module flexure under load throughout stress test and system
operating temperature ranges. Thermal effectiveness was
measured by periodically monitoring the power draw of each
GPU module and an internal device (junction) temperature
over the course of the simulated life cycle. At the end of each
evaluation, all GPU assemblies were then disassembled and
assessed for TIM condition, which was then correlated with
the final thermal resistance and power measurements. Based
on comparison of both initial build performance and final test
results, an optimized mounting construction was developed
that incorporates the co mpressible graphite T IM.
KEY WORDS: Thermal Interface Material, Thermal
Interface Reliability, Contact Resistance, ASTM D5470,
Compressible Graphite TIM
Mark Hoffmeyer is with IBM Corporation, Systems Technology Group,
Rochester, MN 55901, [email protected], 507-253-6686.
Prashanth Submarianian is with Advanced Energy Technologies LLC.,
Lakewood, OH 44107, [email protected],
216-618-9975.
Rick Beyerle is with Advanced Energy Technologies LLC., Lakewood, OH
44107, [email protected], 216-529-3719.
INTRODUCTION
System Application & Hardware Description
Recently announced IBM High Performance Computing
(HPC) systems [1] incorporate NVIDIA GP100 Graphics
Processing Units (GPUs) powered by Pascal architecture [2]
to accelerate deep learning and computation intensive
applications. Unlike PCIe graphics cards, which mount
vertically and contain built-in fans, the NVLink graphics
platform presents a flat surface bare die for the integrator
(IBM) for high power liquid or forced air cooling. More
precisely, the TSMC-based CoWoS® [3,4] GPU architecture
employs a 2.5D multi-chip device area consisting of a large
bare die GPU chip coupled with four, quadruple stacked 3-
dimensional integrated circuit (3DIC) flash memory chips
that are collectively mounted atop a silicon interposer. The
CoWoS® construction is incorporated onto a 55mm organic
ball grid array (BGA) package that is subsequently surface
mount assembled to a GPU card assembly (Figure 1).
Proximity of the memory to the GPU chip creates a high
performance, high power (300W) package with the
communication speeds necessary for optimized
computational acceleration. Prior to system integration, a
cooling solution must be affixed to the bare die device surface
area and GPU card assembly. While IBM’s HPC systems
designs facilitate either air or water cooling, a serviceable low
thermal impedance TIM is critical in the attachment stack up
[5-7]. For air cooled applications, finned heatsinks
customized for both front and rear system installation having
planarized mounting pedestals coupled to high capacity
thermal transport heat pipes are used with a TIM1 solution.
These heatsinks prevent any of the GPUs from thermal
throttling. For water cooled systems, planarized mounting
pedestals on aluminum heat spreaders are used in conjunction
with interconnected cold plates using both TIM1 and TIM2
interfaces between devices, spreaders, and cold plates for an
efficient cooling path.
Phil Mann is with IBM Corporation, Systems Technology Group, Rochester,
MN 55901, [email protected], 507-253-4636.
Advanced Energy Technologies LLC is a subsidiary of GrafTech
International Holdings, Inc.
Novel Graphite-based TIM for High Performance Computing
Mark Hoffmeyer (IBM), Prashanth Subramanian (AET), Rick Beyerle (AET), Phil Mann (IBM)
IBM Systems & Technology Group, Rochester, MN 55901
[email protected], [email protected]
Advanced Energy Technologies LLC, Lakewood, OH 44107
[email protected], [email protected]
978-1-5090-2994-5/$31.00 ©2017 IEEE 243 16th IEEE ITHERM Conference
Fig. 1. Nvidia GP100 GPU Card Assembly with (A) 55 mm BGA package
with device area consisting of a (B) GPU chip and (C) four memory chip
stacks mounted on a silicon interposer.
GPU Application: Thermal Interface Considerations
The bare die GPU packaging assembly described above
presents several significant challenges to ensure creation of
an efficient and reliable bare die thermal interface material
(TIM1) cooling solution. Because the large device surface
areas are not flat, and have a convex surface curvature with
potential non-uniform heat dissipation characteristics, a
preferred TIM1 solution must provide adequate gap filling
capability to accommodate out of flat conditions.
Furthermore, to minimize potential for packaging interaction
reliability issues, the TIM1 solution, including its thermal
performance, compression, and gap filling characteristics
must also be compatible with mechanical load constraints
used to affix the external cooling hardware to the bare die
GPU card assembly.
A specific and limited load range must be carefully selected
and designed into components used to secure the TIM1
interface and affiliated hardware cooling solution. The
specific load range used must prevent shock and vibration
damage, and also minimize the potential for detrimental chip-
package-card assembly interaction issues, including effects
that can undermine stable and reliable TIM1 performance.
Excess loads used to affix an external cooling solution can
drive stresses sufficient to prompt device damage and/or
deformation of the packaging over time, including stresses
that result from materials creep in the overall assembled stack
of the organic GPU BGA package, the GPU board assembly,
and affiliated BGA interconnections. These mechanically
derived stresses can also superimpose with thermal-
mechanical derived stresses that arise from continuous, or
cyclical in service run time and result in complex strains on
the TIM1 that can also prompt thermal performance
deterioration at the device interface. Hardware shape changes
driven by thermal loads and coefficient of thermal expansion
(CTE) mismatch between device, package substrate, and the
GPU card assembly, can drive dynamic strains on the TIM1,
especially with repeated in service system power cycling. An
optimal TIM1 selection for this application must be resilient
to complex, superimposed, static and dynamic chip-package-
heatsink-board assembly interactions to ensure consistent and
reliable function. Given this application description and
various packaging restrictions, the advantages of using
eGRAF® HITHERM™ HT-C3200 thermal interface
material (HT-C3200) [8], as the TIM1, are described in
greater detail as compared to use of other potential TIM1
solutions.
APPLICATION REQUIREMENTS, RESTRICTIONS,
AND TIM SELECTION
A. Compression, Gap Filling Capability, and Thermal
Performance
Testing conducted within IBM using both Instron and
Thermomechanical Analysis (TMA) methods confirm that
HT-C3200 TIM sheets are highly compressible at the low
pressures and load ranges required for the GPU bare die
cooling application. Compression data collected via TMA at
25°C and 75°C (see Figure 2) indicates the material readily
fills a five mil (127 micrometer) gap* at loads ranging from
10-30 psi.
A. B.
Fig. 2. TMA Compressibility Data for AET HT-C3200. (A) Shows bond line
thickness; (B) shows TIM thickness [9,10]. A Compression range and TIM
gap-filling capability in excess of 6 mils (150 micrometers) is observed
independent of applied temperature. *The post-DMA and Instron test sample
thickness measurements indicate a peak compression of the material to
approximately 55-60 micrometers.
Unlike most gel, grease, and phase change TIM alternatives
that rely on intimate contact of conductive filler particles that
are realized under high pressures and small, uniform gaps, the
HT-C3200 material conducts heat most effectively under
relatively low range of gaps and pressures. Figure 3 illustrates
this performance behavior as collected from material samples
using two test methods. These methods include ASTM
D5470 [11], and use of a custom built IBM thermal tester to
provide in-situ measurements. The custom tool employs top
and bottom copper column sections with affixed stages that
sandwich the TIM under various fixed loads or fixed gaps.
The top section is heated to a controlled temperature, while
the bottom section is cooled with a thermoelectric cooler.
Thermocouples positioned at regular intervals in both column
sections measure the thermal profile across the entire column.
From these data at thermal equilibrium, a temperature drop
across the TIM material can be calculated and resultant
thermal performance properties of the TIM can be identified.
A. B.
Fig. 3. HT-C3200 TIM thermal performance as a function of applied pressure
measured (A) in situ and (B) by ASTM D5470. The in situ thermal resistance
stabilizes at loads of 15 psi or higher.
As shown in Figure 3, stable, low thermal resistance occurs at
pressures as low as 15 psi (103 kPa), with a corresponding
thermal resistance of approximately 0.2 K·cm2/W when the
TIM is compressed by approximately 3 mils (75
micrometers), see Figure 2.
Overall, this materials’ performance data when coupled
with systems modeling and parts characterization work
conducted within IBM show the above attributes provide
sufficient gap filling and thermal performance requirements
for the GPU application, especially so when additional
controls are placed on flatness of external cooling solution
surfaces such that the spreader and heatsink pedestals that
come in contact with the material are flat to within
approximately 25 micrometers.
B. Other Design and Process Considerations
In addition to having adequate gap filling capability and
good thermal performance characteristics, the HT-C3200
material is also well suited for GPU assembly processing.
These attributes include:
1. Ease of assembly with minimal surface preparation: The
material is readily placed onto intended surfaces without
need for additional processing equipment
2. Surface preparation simplicity: use of the HT-C3200
material does not require complex cleaning of thermal
interface surfaces (e.g. plasma). Simple surface cleaning
preparation using IPA in conjunction with a
particle/shed free cloth wipe to remove potential debris
or potential bulk contamination films ensures adequate
and consistent performance.
3. Chemical stability and chemical compatibility in the
presence of other materials: Because the HT-C3200
TIM is pure graphite, it is chemically inert and does not
degrade in the presence of other chemicals, or materials
that might come in contact with it within a GPU
assembly throughout in service operation temperatures.
This is an important feature for the GPU application
because additional TIMs must be used in close proximity
on spreader plates used in the water-cooled GPU
application. Because the material is generally inert in air
to temperatures in excess of 300°C, it will remain stable
when exposed to extreme temperature or humidity.
4. Cost: total installed cost is low cost when compared to
most TIM solutions, especially when coupled with ease
of use and GPU assembly process considerations.
C. Product Application Development and Proper Usage
of HT-C3200 as a Thermal Solution
Although the HT-C3200 has many thermal advantages for
use as a TIM solution, it must be noted that the material is
inelastic, and recovers only a fraction – about three percent –
of its original shape or thickness profile when a load is
removed. As such, the use of static, fixed displacement
controlled loads for creating TIM bond lines in assemblies is
not recommended, as reduction or inconsistent mechanical
load on the TIM over time may result due to card and
component level hardware relaxation, creep, or from short
term dynamic shape changes that can take place in service.
These collective changes can create TIM instability and
increased TIM thermal interface resistance, as they drive the
potential formation of low contact force or local air gaps
between the inelastic TIM, the device, and cooling solution
surfaces.
Inferior HT-C3200 TIM performance was clearly
observed on early GPU assemblies that had heatsinks attached
using two different types of cooling hardware fasteners, that
when used in combination, created a fixed displacement load
control condition. Initial builds of GPU hardware assemblies
used four corner spring screws to attach the heatsink, and to
provide a constant spring load on the TIM. The initial builds
also incorporated four, optional, non-influencing fasteners
(NIFs) [12,13] as additional attachment points. These NIF
attachment points, shown in Figure 4, lock the heatsink into a
fixed position with respect to the module device surface, and
provide an added measure of protection against physical
damage that may arise on hardware exposed to random but
significant shock and vibration events that can occur during
component shipping, system shipping, or from unusual
hardware usage scenarios.
Fig. 4. Exploded view of a GPU Card-heatsink assembly including; (A) GPU
card, (B) Heatsink, (C) Optional NIFs, (4X) (D) Spring loaded attachment
points (4X), and E. the HT-C3200 TIM used between the device surface and
the heatsink pedestal surface.
With NIFs engaged, the attached cooling solution became
a fixed displacement load design when coupled with the 4-
corner spring screws originally used to provide a constant
spring load on the TIM. This fixed displacement load
condition created unacceptable thermal performance of the
HT-C3200 TIM when GPU assemblies were installed and run
in HPC systems at required powers and fans speeds. When
NIFs were removed from these assemblies to provide a
constant spring load on the inelastic TIM, a markedly
improved, and acceptable in-system thermal performance was
achieved. Figure 5 illustrates the change in HT-C3200
thermal performance as a function of GPU heatsink load.
Confirmation of shape changes that occur on hardware and
the affiliated root cause for hardware load dependent thermal
performance differences of HT-C3200 TIM was provided
using Moiré interferometry on a GPU assembly. Specifically,
Moiré analysis of GPU hardware taken through a heating and
cooling excursion used to simulate device power on / power
off cycling shows that dynamic GPU shape changes occur and
are directly linked to strains that develop in the hardware
packaging from aggregate CTE mismatch between the
organic the GPU board, the GPU BGA package, and the large
2.5D silicon on silicon device area on the BGA module.
These relative CTE mismatch driven shape changes are
shown in Figure 6.
Fig. 5. HT-C3200 thermal performance vs. cooling solution hardware attach
design. Temperatures shown are average GPU device Tj measured on GPUs
installed at front and rear locations within the system.
Given generally known CTE differences between an
organic BGA package, an organic printed wiring board
(PWB) assembly, and a large 2.5D silicon on silicon device
area, dynamic Moiré analysis indicates the GPU device area
flattens by approximately 1.2 micrometers for every degree of
temperature increase, and shows this change is reversible
upon cooling. GPU device surfaces are convex at room
temperature, with the convex shape resulting from an
aggregate CTE mismatch derived elastic tensile and
compressive strain distribution present between the organic
carrier and the stacked silicon device. As temperature is
increased, elastic strains between board, BGA carrier and the
Si device are reduced, and result in the part flattening, while
decreases in temperature prompt additional strain on the
assembly and increased part convexity. Since the spring
actuation and NIF assembly of the GPU thermal solution
attachment all takes place and locks in the relative position of
the heatsink to the device area at room temperature, a binding
or buckling condition is created between the hardware as the
GPU assembly attempts to flatten in response to external
heating or power driven temperature rises and affiliated
changes to CTE mismatch induced strains distributions.
These overall effects drive a loss of sufficient contact with the
inelastic HT-C3200 TIM and result in thermal performance
degradation. However, once the NIFs are released, the
hardware binding condition is eliminated and proper thermal
performance can be restored using a pure, constant spring
load control in the design to ensure the TIM stays in contact
with the device and cooling hardware surfaces during in
service operation.
Note that dynamic CTE mismatch induced shape changes
can be largely responsible for TIM degradation issues when
phase change materials (PCMs), gels, or greases are used [14-
17] as they can drive TIM strain related phenomenon such as
grease pumping, gel interface adhesion loss, and related PCM
“healing” instabilities as well.
Based on the above considerations and data analyses,
proper use of HT-C3200 material must be coupled with
constant spring force loads that are incorporated into a given
hardware attachment design. In addition, due to the
inelasticity of the material, reuse is not recommended in the
event that hardware disassembly, rework, or replacement
steps are required. Reused material will not provide gap
filling on hardware with different shape or topographic
profiles.
Fig. 6. Summary of Moiré interferometer analysis of a GPU assembly
illustrating relative CTE driven flatness changes vs. temperature. GPU device
areas are convex at room temperature. A negative change in flatness indicates
less convexity and improved flatness.
PRODUCT QUALIFICATION SUMMARY
Because the general performance characteristics of the
HT-C3200 TIM material, along with its additional processing
advantages, are shown to offer a possible GPU TIM solution,
it was selected as a candidate material for product
qualification test work using the four corner spring screw
heatsink attachment design in the absence of NIF attachments.
An alternate, conventional, high performance TIM grease
solution was also tested in parallel in the same application
product form factor. Details of the testing and test results are
shown below.
A. GPU TIM Product Qualification
The overall TIM qualification took place in two key stages
termed T1 and T2. The T1 stage consisted of early GPU
hardware assembled with a split of hardware samples built
using the HT-C3200 TIM, and samples built using a high
performance thermal grease TIM sandwiched between bare
die GPU devices and pedestals of air cooled heatsinks. All
were assembled using the four-corner spring loaded hardware
in the absence of NIFs, as previously discussed.
These hardware sets were also built with both front and rear
GPU heatsinks as shown in Figure 7, as installed in an IBM
Power™ System 822LC HPC server (Minsky).
A sequence of in-situ accelerated stress tests was run,
simultaneously using all four heatsinks, front and rear, for
functional verification. The test sequence began with System
Level Shock & Vibration (S&V), then proceeded to
Temperature & Humidity (T&H), Thermal Ship Shock (TSS),
Accelerated Thermal Cycling (ATC), and Deep Thermal
Cycling (DTC). Slightly accelerated Power Cycle/Power Age
Tests (PA/PC) were used for additional in-situ system GPU
stressing and continuous functional monitoring of the GPU
hardware.
Fig. 7. Photo of front (F ) and rear (R ) GPU-heatsink assemblies installed
into an IBM HPC server system.
In all cases, GPU powers and operating junction
temperature (Tj) measurements were collected and monitored
as a figure of merit, with a change in average Tj of 5°C or
more used as a pass/fail criteria, adjusting for slight ambient
temperature fluctuations present in the system operating
environment. A plan view of the IBM 2S2U Minsky HPC
system computer electronics complex is shown in Figure 8
and shows all four GPUs installed (without heatsinks) in front
and rear slot locations.
It is important to note that power fluctuations were fairly
common on early GPU hardware used in development, so
preliminary functional testing and functional verification
measurements on GPUs within HPC systems and T1
hardware were not necessarily tested at peak powers. Typical
preliminary testing occurred using a range of powers
spanning approximately 180-280W. Neither the heatsinks
nor early GPUs used for T1 tests had optimized flatness, so
additional variability in thermal measurements could arise
from an increased gap filling range required for the TIM
solutions. Despite these shortcomings, the primary goal of
the T1 testing was to identify a single TIM solution to bring
forward for final (T2) product qualification. This T2
qualification would consist of all power stable, production
level GPU hardware capable of running at full speeds and
powers in the range of 280-300 W.
B. T1 GPU Test Summary
A summary of the T1 Test plan, sample sizes used, and
corresponding test results are shown in Figure 9. All parts
built with both the HT-C3200 and grease TIM solutions
passed in-situ PA/PC system testing, but 25% of the parts
built with grease showed some evidence of modest
temperature degradation that neared the pass/fail criteria. All
parts built with both the HT-C3200 and grease TIM solutions
Fig. 8. Plan view of the 2S2U HPC system showing GPUs installed at front
(F3, F7) and rear (R2, R6) slot locations. Air flows from right to left in the
diagram.
passed in-situ PA/PC system testing, but 25% of the parts
built with grease showed some evidence of modest
temperature degradation that neared the pass/fail criteria. An
example of the hot cycle thermal stability resulting from in
situ system PA/PC tests is illustrated in Figure 10 for an
assembly built with the HT-C3200 TIM and tested through
more than twelve hundred full power on/power idle cycles
using a 1-hour cycle consisting of 15 minute ramp and dwell
periods.
T1 Product Test Summary HT-C3200 Grease
Stress Test
Sam-
ples
Fails
(∆T>5°C)
Sam-
ples
Fails
(∆T>5°C)
S&V 2 0 2 0
TSS (-40—65°C, 10 cycles) 5 0 5 0
T&H (50/80, 200 hrs) 0 0
ATC (0—100°C, 500 cycles) 0 4
System PA/PC
(35—80°C, Up to 1350
cycles)
8 0 8 2
ATC (0-75°C, 500 cycles) n/a n/a 5 3
DTC (-50—100°C, 200
cycles)
6 0 n/a n/a
Fig. 9. Summary of T1 product qualification test plan and test results
All parts passed sequential TSS and T&H tests. However,
monitoring of parts subjected to follow-on ATC tests showed
that 80% of the parts built with the grease TIM suffered
significant thermal performance degradation with first
failures occurring after 170 cycles of ATC and 80% of parts
failing the criteria after 500 cycles. An example of this
deterioration is shown in Figure 11 for a GPU with grease
TIM1 that ceased to function after exposure to 500 cycles of
0-100°C ATC.
Fig. 10. Example PA/PC temperature and power for a GPU built with the
HT-C3200 TIM. Fluidic diodes on the system fans constrain airflow to drive
higher in-system operating temperatures for stress acceleration. A system fan
speed adjustment was made at approximately 350 cycles into the test to
prevent GPU Tjs from running in excess of 80°C.
Fig. 11. Grease TIM thermal performance deterioration vs. HT-C3200 TIM
thermal stability after exposure to ATC test legs as functionally tested on
example parts in system slot positions R6 (grease) and F3 (HT-C3200). The
example part built with the grease TIM failed to power on after exposure to
a 500+ cycle ATC checkpoint.
In contrast, all parts made with the HT-C3200 TIM
remained stable through 500 cycles of 0-100°C ATC. An
example of this stability is also shown in Figure 11 for
comparison. Because notable thermal degradation in ATC
tests was identified in GPU test cells built with grease,
supplemental ATC test cells on additional parts made using
the grease TIM were also run using an intermediate 0-75°C
ATC stress regimen. These test results when coupled with 0-
100°C ATC test data and results from power cycling work
were collectively used to help generally assess and identify a
rough grease ATC test acceleration factor estimate for use in
this and other product applications. As shown in Figure 9,
0-75°C ATC testing of parts built with the grease TIM
resulted in slightly prolonged adequate thermal performance
prior to degradation relative to 0-100°C ATC tests, but 60%
of the parts still failed the thermal performance pass/fail
criteria after parts reached 500 cycles of test.
Another set of parts built with the HT-C3200 TIM was
subjected to -50-100°C DTC stress to further assess its overall
robustness in the GPU application. In all cases tested, no fails
were encountered using the HT-C3200 TIM.
C. Post - T1 GPU Testing Failure Analysis
All parts built with grease TIMs were disassembled after
stressing to identify root causes of thermal degradation. Parts
built with the HT-C3200 TIMs were also disassembled to
look for potential changes to the interface material as a result
of the stress test exposures. In this latter case, nothing
remarkable was found. The HT-C3200 appeared to be fully
intact and compressed with no signs of any erosion or
deterioration issues. In fact, test data also shows that
HT-C3200 that is intentionally crumpled -- inflicted with
significant creasing -- prior to use performs equally as well as
pristine material. This is an important attribute to understand
from assembly manufacturing and quality control
perspectives, because the material is somewhat delicate to
handle. As such, the TIM may be subject to minor handling
issues in volume manufacturing that result in the formation of
one or two creases on the material, especially at or near
corners of TIM preforms.
A collection of photos in Figure 12 show the general
condition of post stress tested HT-C3200 TIM along with
TIM that was also intentionally and significantly damaged
before and after use, while Figure 13 shows a thermal
corresponding thermal performance comparison of the
intentionally damaged vs. undamaged material when used and
tested in the same GPU assembly.
Fig. 12. Photos of HT-C3200 TIM pieces (A) prior to installation, (B) after
assembly and stress tests, (C) prior to installation, intentionally crumpled, (D)
intentionally crumpled piece after assembly, compression, and thermal
performance test. Compression eliminated most of the creases and caused
the transfer of an image of laser scribed device information onto the TIM.
Upon disassembly of hardware built with the grease TIM,
grease pumping with significant depletion of grease away
from the GPU device area was found to be the obvious root
cause mechanism for thermal performance deterioration on
most parts exposed to the ATC stress tests.
An example of significant grease pumping observed on
samples after five hundred 0-100°C ATC cycles is provided
in Figure 14A. For comparison, grease TIM coverage
produced on a typical uncycled part shown is Figure 14B.
Figure 14A shows the hardware sample that produced the
GPU6 thermal data shown in Figure 11.
Fig. 13. Undamaged vs. intentionally crumpled HT-C3200 thermal
performance test comparison as installed into a common GPU.
Fig. 14. Thermal grease pump-out (A) after 500-cycle ATC test and footprint
(B) as built.
D.T2 In-System GPU Test Summary
Given the T1 test results and physical failure analysis
findings, use of a high performance grease as a TIM was
abandoned from further consideration for final product
qualification activity. Twelve additional production level
parts assembled with HT-C3200 graphite TIM were put into -
50-100°C DTC, while four of these parts were also subjected
to system-level S&V tests as well. These two tests coupled
with prior T1 results defined the final qualification for product
introduction.
In addition to stress test exposures and functional
assessments similar to those described above for T1 testing,
once parts completed the T2 stress tests and final affiliated
thermal measurements, they were also exposed to an
additional 90 days of run time monitoring in IBM Joint
Engineering-Manufacturing Test machines that were also
exposed to extensive and rigorous power cycling, corners
testing, and altitude chamber testing. No issues were
observed with any of the GPU hardware. All device Tjs
remained stable to 2-4°C throughout all tests and monitoring
exercises. This range of stability also includes ambient
temperature variability on the development lab data center
floor. Excerpt examples of specific T2 test results are shown
in Figures 15 and 16 and include GPU Tj stability
measurements before and after DTC and S&V tests.
Fig. 15. Example of thermal stability from 4 different GPU locations
(shown in Figure 8) built with HT-C3200 TIM, as tested in an HPC system
pre-vs post DTC T2 qualification test. Corresponding GPU power levels
for the parts when tested before and after DTC exposures are also shown.
Fig. 16. Example thermal stability of front and rear positioned GPUs in an
HPC system as measured before and after system level S&V test.
SUMMARY AND CONCLUSIONS
A new, compressible, graphite thermal interface material
has been tested and qualified for use as a TIM1 on large, 2.5D
silicon on silicon bare die BGA packages and affiliated GPU
card assemblies that are integrated into recently announced
IBM high performance computing systems to provide
accelerated processing function. Evaluation, test, and
qualification of this new compressible TIM within and
outside of the GPU application shows it can be successfully
used in high power applications that require thermal gap
filling capability up to 0.005 inches (125 micrometers) when
coupled with cooling hardware attached at loads as low as 15-
25 psi. Because the material is intrinsically inelastic, the
material should be used in conjunction with mechanical
packaging designs that incorporate a constant spring load for
attachment of cooling hardware elements to ensure consistent
and reliable TIM function. Although not discussed, this
interface material has also been tested and integrated into
water cooled versions of the IBM HPC server referenced in
this paper as a TIM1 GPU solution, and for use as a TIM2
cooling solution between GPU heat spreaders and cold plate
assemblies as well.
ACKNOWLEDGEMENT
The authors gratefully acknowledge the support of IBM
development engineering, manufacturing engineering, and
contract engineering support from Sarah Czaplewski,
Timothy Jennings, Eric Campbell, David Braun, David
Barron, Steve Miranda, Matthew Scheckel, Dave Nickel,
Jeffrey Johnson, and Matthew Farmer, for their assistance
with materials properties evaluations, hardware assembly,
stress test support, test support, and system thermal
performance monitoring that took place throughout the course
of this development effort. We also would like to thank
Martin Smalc and Larry Jones of the AET Innovation and
Technology Center for guidance and assistance with using the
D5470 instrument to characterize the compressible graphite
materials. We would also like to thank Andy Reynolds, Jason
Murphy, and Julian Norley for their support and guidance.
REFERENCES
[1] S822LC Server for Big Data (product white paper),
retrieved 2017 February 13,[Online] IBM Power
Systems, www.ibm.com
[2] NVidia Tesla P100 White Paper, 2017 February 13,
WP-08019-001, [Online] Nvidia Corporation,
www.nvidia.com
[3] TSMC CoWoS Foundry Services, 2017 February 13,
[Online], www.tsmc.com
[4] Chip On Wafer On Substrate (CoWoS), 2012,
November 3, [Online], Daniel Payne,
www.semiwiki.com
[5] Norley, Julian. "The Role of Natural Graphite in
Electronics Cooling." Electronics Cooling Magazine 7
(2001): 50-51
[6] Chu, R.C., Ellsworth, M.J. Jr., Simons, R.E.,
2002,“Thermal Spreader and Interface Assembly for
Heat Generating Component of an Electronic Device,”
US Patent 6396700 B1
[7] Marotta, E. E., S. LaFontant, D. McClafferty, S.
Mazzuca, and J. Norley. "The Effect of Interface
Pressure on Thermal Joint Conductance for Flexible
Graphite Materials: Reno, NV." (2002). ITHERM
[8] eGRAF® and HITHERM™ are trademarks of
Advanced Energy Technologies LLC.
[9] Smalc, Martin, Julian Norley, R. Andy Reynolds,
Richard Pachuta, and Dan W. Krassowski. "Advanced
thermal interface materials using natural graphite."
In ASME 2003 International Electronic Packaging
Technical Conference and Exhibition, pp. 253-261.
American Society of Mechanical Engineers, 2003
[10] HITHERM™ HT-C3200 Thermal Interface Material
Technical Data Sheet TDS-319, GrafTech
International/Advanced Energy Technologies LLC
[11] Standard, ASTM D5470-12 Standard Test Method for
Thermal Transmission Properties of Thermally
Conductive Electrical Insulation Materials, West
Conshohocken, PA: ASTM International (2012)
[12] Jeffrey F. Boigenzahn, Darrell E. Bratvold, James M.
Rigotti, Lyle R. Tufty, “Noninfluencing fastener for
disk drives”, United States Patent US4945435 A, issued
July 31, 1990
[13] John Lee Colbert, Eric Alan Eckberg, Roger Duane
Hamilton, Mark Kenneth Hoffmeyer, Amanda Elisa
Ennis Mikhail, Arvind Kumar Sinha, “Mounting a
heatsink in thermal contact with an electronic
component”, United States Patent US7944698 B2,
issued May 17, 2011
[14] Lim, T. Y., and Michelle Velderrain. "Calculated shear
stress produced by silicone and epoxy thermal interface
materials (TIMS) during thermal cycling." Electronics
Packaging Technology Conference, 2007. EPTC 2007.
9th. IEEE, 2007
[15] Methodologies to Mitigate Chip-Package Interaction,
2015, August 5, Chae, Seung-Hyun Chae; Nangia,
Amit, Electronic Design Magazine, [Online]
www.electronicdesign.com
[16] Thermal Strain in Semiconductor Packages, Part II,
2007 November 1, Bruce Guenin, Electronics Cooling
Magazine
[17] Advanced Materials for Thermal Management Solutions
of Electronic Packaging, 2011, Xingcun Colin Tong
Ph.D, Springer Science+Business Media, LLC