architectural vulnerability modeling and analysis of ...gurumurthi/papers/selse13_avf.pdf ·...

6
Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran Jeon * Mark Wilkening Vilas Sridharan Sudhanva Gurumurthi Gabriel H. Loh * Ming Hsieh Department of Electrical Engineering AMD Research RAS Architecture University of Southern California Advanced Micro Devices, Inc. Advanced Micro Devices, Inc. [email protected] {mark.wilkening, vilas.sridharan, sudhanva.gurumurthi, gabriel.loh}@amd.com Abstract—Thanks to the massive parallel processing power and programmability of general-purpose graphics processing units (GPGPUs), many supercomputing centers as well as servers and high-end mobile devices are increasingly using GPUs for both graphics and general purpose computation. However, communica- tion costs between host CPUs and GPUs have been a performance bottleneck. Recent industry trends towards accelerated processing units (APUs) that integrate CPUs and GPUs on a single die can significantly lower the communication costs and allow for more seamless use of these processing components. As the number of applications for APUs increases, reliability of the APU becomes paramount. In this paper, we describe an architectural vulnerability factor (AVF) modeling framework for APUs that we developed, and we present AVF results for several workloads. Our results include AVF characterization of key hardware structures in the GPU component of the APU and the variation in the AVF over time due to the workload execution on both the CPU and GPU sides of the APU. We also examine the impact of APU sizing on the AVF of the workloads. I. I NTRODUCTION GPUs have undergone a transformation, from devices used mainly for graphics to those that significantly benefit general- purpose computing as GPGPUs. GPGPUs are now used in a variety of computing systems, from high-end servers to smaller form-factor devices such as tablets and smartphones. The advent of languages such as OpenCL TM and Nvidia’s CUDA TM have made parallel programming accessible to a larger pool of developers and has given rise to a rich set of applications that can leverage the computing power of GPUs. A recent industry trend has been to integrate CPUs and GPUs on a single die (e.g., AMD’s Trinity and Intel’s Ivy Bridge). These processors, which we will refer to as accelerated processing units (APUs), reduce the communication overheads between the CPU and GPU portions and also facilitate new architectures that exploit the chip’s heterogeneous computing capabilities. One example of this is the Heterogeneous System Architecture (HSA), which is promoted by the HSA Foundation with several industry partners, including AMD, Qualcomm, Samsung, and ARM [9]. The features of HSA include shared page tables between CPUs and GPUs, support for preemption and context switching on GPUs, and cache coherence between CPUs and GPUs. AMD recently announced plans for a server-class APU [4]. Given these industry trends, it is important to assess the reliability of APUs to devise effective RAS strategies. Soft errors are a key reliability problem for current and future technology nodes. Soft errors are random bit flips that are caused primarily by high-energy neutrons from terrestrial cosmic rays. While soft errors do not cause any permanent damage to the circuit, the spurious bit-flips can affect the correctness of the computation. A commonly used technique to assess the impact of soft errors at the early stages of the design cycle of a processor is architectural vulnerability factor (AVF) analysis [5][10][16]. The AVF of a given hardware structure is the probability that a bit-flip in that structure will manifest itself as an error in the externally visible state of the machine. The failure in time (FIT) rate of the structure can be calculated by multiplying the AVF with the timing vulnerability factor (TVF) [11], the number of bits in the structure, and the technology-dependent raw FIT rate. While there has been prior work on AVF modeling and analysis of CPUs [10] and GPUs [7][14], to our knowledge there is no prior work for APUs. Unlike systems with a CPU and a discrete GPU, on which the reliability assessments of the two processors are done separately (possibly by different vendors), calculating the FIT rate of an APU needs to consider both processing components and their associated interconnect and memory systems. Furthermore, a key choice that vendors must make when architecting an APU is the relative sizing of the CPU and GPU components of the die. It is critical to understand the impact of this decision on the performance, power, and reliability of the target workloads. Finally, programs written for architectures such as HSA and executed on an APU may exhibit different reliability characteristics than when executed on a discrete GPU (for example, due to fine-grained memory sharing between CPU and GPU components of the execution). Therefore, the results of AVF analysis from a non- APU system cannot necessarily be applied to an APU. The key contributions of this paper are: The paper presents an AVF modeling infrastructure that we have developed for APUs. This infrastructure leverages an in-house version of the gem5 architecture simulator that models an HSA-like APU [1]. We show AVF results for the CPU and GPU register file and the GPU instruction buffer. A key novel result we present is the AVF variations over time in these hardware structures that can guide the use of configurable protection mechanisms [6][15]. We examine the impact of GPU sizing on the AVF. Our results show that increasing the number of compute units in the GPU portion of an APU can either increase or decrease the reliability of the APU. The rest of the paper is organized as follows. Section II discusses related work. Section III provides background on the design of AMD APUs. Section IV describes the design of our framework for AVF measurement of an APU. Section

Upload: others

Post on 12-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

Architectural Vulnerability Modeling and Analysis ofIntegrated Graphics Processors

Hyeran Jeon∗ Mark Wilkening † Vilas Sridharan‡ Sudhanva Gurumurthi† Gabriel H. Loh†∗Ming Hsieh Department of Electrical Engineering †AMD Research ‡RAS Architecture

University of Southern California Advanced Micro Devices, Inc. Advanced Micro Devices, [email protected] {mark.wilkening, vilas.sridharan, sudhanva.gurumurthi, gabriel.loh}@amd.com

Abstract—Thanks to the massive parallel processing power andprogrammability of general-purpose graphics processing units(GPGPUs), many supercomputing centers as well as servers andhigh-end mobile devices are increasingly using GPUs for bothgraphics and general purpose computation. However, communica-tion costs between host CPUs and GPUs have been a performancebottleneck. Recent industry trends towards accelerated processingunits (APUs) that integrate CPUs and GPUs on a single die cansignificantly lower the communication costs and allow for moreseamless use of these processing components. As the number ofapplications for APUs increases, reliability of the APU becomesparamount.

In this paper, we describe an architectural vulnerability factor(AVF) modeling framework for APUs that we developed, and wepresent AVF results for several workloads. Our results includeAVF characterization of key hardware structures in the GPUcomponent of the APU and the variation in the AVF over timedue to the workload execution on both the CPU and GPU sidesof the APU. We also examine the impact of APU sizing on theAVF of the workloads.

I. INTRODUCTION

GPUs have undergone a transformation, from devices usedmainly for graphics to those that significantly benefit general-purpose computing as GPGPUs. GPGPUs are now used in avariety of computing systems, from high-end servers to smallerform-factor devices such as tablets and smartphones. Theadvent of languages such as OpenCL

TMand Nvidia’s CUDA

TM

have made parallel programming accessible to a larger pool ofdevelopers and has given rise to a rich set of applications thatcan leverage the computing power of GPUs. A recent industrytrend has been to integrate CPUs and GPUs on a single die(e.g., AMD’s Trinity and Intel’s Ivy Bridge). These processors,which we will refer to as accelerated processing units (APUs),reduce the communication overheads between the CPU andGPU portions and also facilitate new architectures that exploitthe chip’s heterogeneous computing capabilities. One exampleof this is the Heterogeneous System Architecture (HSA), whichis promoted by the HSA Foundation with several industrypartners, including AMD, Qualcomm, Samsung, and ARM [9].The features of HSA include shared page tables between CPUsand GPUs, support for preemption and context switching onGPUs, and cache coherence between CPUs and GPUs. AMDrecently announced plans for a server-class APU [4]. Giventhese industry trends, it is important to assess the reliability ofAPUs to devise effective RAS strategies.

Soft errors are a key reliability problem for current andfuture technology nodes. Soft errors are random bit flips thatare caused primarily by high-energy neutrons from terrestrialcosmic rays. While soft errors do not cause any permanent

damage to the circuit, the spurious bit-flips can affect thecorrectness of the computation. A commonly used techniqueto assess the impact of soft errors at the early stages of thedesign cycle of a processor is architectural vulnerability factor(AVF) analysis [5][10][16]. The AVF of a given hardwarestructure is the probability that a bit-flip in that structure willmanifest itself as an error in the externally visible state of themachine. The failure in time (FIT) rate of the structure can becalculated by multiplying the AVF with the timing vulnerabilityfactor (TVF) [11], the number of bits in the structure, and thetechnology-dependent raw FIT rate.

While there has been prior work on AVF modeling andanalysis of CPUs [10] and GPUs [7][14], to our knowledgethere is no prior work for APUs. Unlike systems with a CPUand a discrete GPU, on which the reliability assessments ofthe two processors are done separately (possibly by differentvendors), calculating the FIT rate of an APU needs to considerboth processing components and their associated interconnectand memory systems. Furthermore, a key choice that vendorsmust make when architecting an APU is the relative sizingof the CPU and GPU components of the die. It is criticalto understand the impact of this decision on the performance,power, and reliability of the target workloads. Finally, programswritten for architectures such as HSA and executed on anAPU may exhibit different reliability characteristics than whenexecuted on a discrete GPU (for example, due to fine-grainedmemory sharing between CPU and GPU components of theexecution). Therefore, the results of AVF analysis from a non-APU system cannot necessarily be applied to an APU.

The key contributions of this paper are:

• The paper presents an AVF modeling infrastructurethat we have developed for APUs. This infrastructureleverages an in-house version of the gem5 architecturesimulator that models an HSA-like APU [1].

• We show AVF results for the CPU and GPU registerfile and the GPU instruction buffer. A key novelresult we present is the AVF variations over time inthese hardware structures that can guide the use ofconfigurable protection mechanisms [6][15].

• We examine the impact of GPU sizing on the AVF. Ourresults show that increasing the number of computeunits in the GPU portion of an APU can either increaseor decrease the reliability of the APU.

The rest of the paper is organized as follows. Section IIdiscusses related work. Section III provides background onthe design of AMD APUs. Section IV describes the designof our framework for AVF measurement of an APU. Section

Page 2: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

System MemorySystem Memory

. . . .   . . .. . . .   . . .LMLM LMLM LMLMCacheCache

RFRF RFRF RFRFSchedulerScheduler

CPU

GPU

APU

Instruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction bufferInstruction buffer

Register File

Register File

4 phases

. . .Processing Elements

. . .Processing Elements

High Performance BusHigh Performance Bus

CU

Fig. 1. AMD APU architecture

V describes the detailed instrumentation on GPU for AVFmeasurement. Section VI describes our evaluation methodologyand presents our results, and we conclude in Section VII.

II. RELATED WORK

There has been a large body of research on AVF modelingand analysis for the CPU, starting with the work by Mukherjeeet al. who coined the term and developed the initial method-ology [10]. The authors proposed the concept of ACE andunACE bits and quantified the AVF of the instruction queueand execution units of an Intel Itanium2-like IA64 processor.Later, Weaver et al. proposed an idea to reduce the AVFby selectively squashing instructions when long delays areencountered [16]. Biswas et al. developed a methodology tocalcuate the AVF of address-based structures such as caches,the TLB, and the store buffer [5]. Sridharan and Kaeli extendedthe notion of vulnerability to allow independent measurementof vulnerability for each layer in the system stack [13]. Forinstance, program vulnerability factors (PVF) quantify thearchitecture-level fault masking inherent in a user program.

There have been some recent publications on AVF modelingfor the GPU. Tan et al. measured the AVF of the registerfile, streaming processor, warp scheduler, and shared memoryof an NVIDIA Fermi-like GPGPU by varying the number ofthreads per compute unit (CU) and applying dynamic warpformation and identified certain tradeoffs between performanceand reliability [14]. Farazmand et al. also showed AVF mea-surements for an AMD Radeon

TM5870-like GPGPU but

used fault injection rather than the ACE/unACE classificationmethodology proposed by Mukherjee [7].

To our knowledge, this paper is the first attempt to study theAVF of APUs specifically. Because APUs are architecturallydistinct from systems with discrete CPUs and GPUs andworkloads may exercise them differently, it is important tomodel and study this emerging class of processors.

III. APU ARCHITECTURE

The baseline APU we model consists of an x86 CPU anda GPGPU, as shown in Figure 1.

An APU and a system with a CPU and discrete GPGPUdiffer in several ways. The CPU and GPGPU in an APUare implemented on the same silicon die. The two processingsubsystems in an APU communicate via a shared systemmemory, whereas a discrete GPU system uses the PCIe busto communicate with the host CPU. AMD’s APUs use a high-performance bus to allow both CPU and GPU to use the samephysical memory. However, the system memory is partitionedinto two separate regions that are dedicated to either the CPUor the GPU. The CPU and GPU use separate page tables andTLBs as well. However, the data-copy overhead is lower thanin a discrete GPU system by orders of magnitude becausethe CPU and GPU use the same physical memory. FutureHSA-enabled APUs are expected to provide a unified memorysystem between the CPU and GPU cores with full coherenceand context switching for the GPU [9].

GPGPU schedules work-items to the CU in groups ofwork-items known as a wavefront. In this paper, we assumea wavefront size of 64 work-items. The work-items within awavefront execute the same code in a lock-step manner. Allthe work-items share one program counter (PC) but accessdifferent data operands. Such an execution approach is calledsingle instruction multiple threads (SIMT) execution. A CUconsists of a scalable number of processing elements (PEs)that are fed by instruction buffers to execute the instructionsof each wavefront.

IV. AVF MODELING METHODOLOGY FOR THE APU

The AVF of a hardware structure is the probability that afault in the structure will manifest itself in the externally visiblestate. AVF estimation is done by identifying bits that affectarchitecturally correct execution (ACE) of the processor whilerunning a benchmark in a performance simulator. The AVFdepends on the bandwidth and residence times of the ACE bitsthat flow through the structure and on the size of the structure.Therefore, the AVF of a structure is a function of both itsmicroarchitecture and the characteristics of the application. Thesoft error rate of a structure can be calculated by multiplyingits AVF by its size, and the (technology-dependent) raw failurerate of each bit within the structure. For the purposes of thispaper, we assume that TVFs in each structure are set to 1 [12].

The AVF measurement infrastructure is an updated versionof the AVF infrastructure used in previous versions of the M5simulator [13]. The basic AVF measurement is conducted intwo phases: an event-tracking phase followed by an analysisphase. The event-tracking phase collects the time when keyevents that can potentially affect the ACEness of the bits inthe structure occur. The analysis phase calculates the AVF byanalyzing the collected event times.

While the basic steps to measure AVF are the same in boththe CPU and GPU, there are some key differences. First, due toSIMT execution on the GPU, each event needs to be recordedfor all threads within a SIMT execution unit. Moreover, theGPU has a much larger number of thread-contexts than theCPU, and inter-thread communication on a GPU can happen inthe register file or in memory, unlike a CPU in which all threadcommunications occurs through memory. Our infrastructure

Page 3: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

accounts for these cross-thread register dependencies on theGPU.

V. EXPERIMENTAL SETUP AND AVF INSTRUMENTATION

In this section, we describe the AVF instrumentationmethodology for the hardware structures in the APU.

A. GPU Vector Register File

The GPU register file consists of three different registertypes: 64-bit and 32-bit integer registers, and single-bit condi-tion registers. Each work-item has 64 of its own 64-bit integerregisters, 120 of its own 32-bit integer registers, and eightsingle-bit condition registers. We separately measure AVF foreach type of register and combine these to create a single GPUregister file AVF.

The bits of each register are considered ACE when the valuecontained in them is consumed by instructions and thereby con-tributes to the final result of the program execution. Therefore,the times of write and read events on each register are trackedto measure the AVF. When there is a data dependency amonginstructions on a register, the period between write and readevents is added to the ACE time. The AVF is then calculatedby dividing the aggregated ACE time × ACE bits by thetotal execution time× total bits at the end of the execution.

We separately model a GPU status register, VCC, thatis required for correct execution. VCC is a 64 × 32-bitstatus register such that each 32-bit value indicates if theresult from the last vector compare or integer carry-out of thecorresponding SIMT lane is zero. Because VCC is one of themost important status registers that directs the program to thecorrect control flow, a bit-flip in the VCC will likely lead toincorrect execution.

B. Instruction Buffer

We model GPU instruction buffers that are grouped intofour phases such that each phase includes individual buffersfor ten wavefronts, for a total of 40 instruction buffers per CU.Each instruction buffer for a single wavefront contains twelve64-bit instruction slots.

The vulnerability of an instruction buffer depends on thetime that ACE bits of instructions spent in the buffer. Typically,instructions are pushed to the instruction buffer after beingfetched from the I-Cache and stay in the buffer until theyare loaded by the decoder. The GPU we model uses a lazyinstruction fetch policy wherein instructions are fetched fromthe I-Cache only when there are fewer than four instructionsremaining to be issued in the instruction buffer. Instructionremoval also happens when new instructions are added tothe instruction buffer. This lazy instruction fetch effectivelydecreases the residency of ACE bits within the instructionbuffer because the instructions that are already loaded by thedecoder but not yet removed from the instruction buffer donot cause incorrect execution. Instructions are flushed from theinstruction buffer when there is a taken branch. These flushedinstructions that have not been issued are also regarded asunACE bits.

Name Total Simulated Instructions CPU Cycles GPU CyclesvectorAdd 6,580,758 37,052,124 20,170mm 1,345,845 7,780,001 105,811nn 1,476,127 29,026,235 1,356,414backprop 194,386,790 1,296,851,359 1,683,951hotspot 33,056,163 175,669,148 21,051bfs 44,711,546 248,798,576 74,707

TABLE I. WORKLOADS

C. CPU Register File

The CPU register file is a relatively simple physical registerfile model that consists of 32 physical registers implementingan x86 architectural register file. The register file containsdata values, but does not store x86 flag values because theseare stored separately. Calculation of ACE and unACE timeproceeds similarly to the calculation for GPU vector registers.

VI. RESULTS

A. Experimental Setup

Our evaluations were carried out using an in-house simu-lator developed by AMD that models next-generation APUs.The simulator consists of a CPU-side simulator that is a variantof gem5 and a GPU-side simulator that can run GPGPU appli-cations written in CUDA and OpenCL [1]. We model a GPUin which each CU has 40 instruction buffers, a 640KB registerfile, and 64 PEs. A collection of benchmarks selected fromRodinia and CUDA-SDK are used for the workloads [2][3].Table I shows the simulation details of each benchmark. Defaultparameters can be found with the Rodinia benchmark suite [3].

In this paper, we tracked two metrics of interest. First,we measure average AVF for each structure of interest. Thisprovides insight into the behavior of each structure. Second,we calculate the normalized failure rate for each structure asAVF x structure size. This indicates the relative contribution ofeach structure to the overall system failure rate.

Finally, we also track the AVF variations across the exe-cution time of each application at regular intervals. This time-varying AVF is useful in visualizing trends in the AVF variationin the CPU and GPU and identifying any potential AVFchanges based on the interaction between the two parts of theAPU. Time-varying AVF plots provide insights into the quan-tized AVF of the structure that can aid in the design of AVFpredictors and configurable redundancy mechanisms [6][15].Time-varying AVF plots also aid in understanding the impactof specific workload behaviors on system-level reliability.

B. Average AVF

Figure 2 shows the AVF results for a subset of ourworkloads. The average AVF on the CPU side is much higherthan on the GPU. The reason is that the parallel portion of theprogram (executed on the GPU) is only a small portion of theoverall program’s execution.

Figure 3 shows normalized failure rate (FIT) results forthe same set of workloads. Despite the low AVF on the GPUside, the GPU registers have a significantly higher normalizedFIT than the CPU registers due to their large relative size. Thisindicates that GPU registers are more important to protect thanCPU registers.

Page 4: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

0

10

20

30

40

50

AV

F (%

)

CPU Register File

AV

F (%

)

(a) CPU register file

0

0.5

1

1.5

2

AV

F (%

)

GPU Register File

AV

F (%

)

(b) GPU vector register file

0

0.002

0.004

0.006

0.008

AV

F (%

)

VCC

(c) GPU VCC register

0

0.5

1

1.5

AV

F (%

)

Instruction Buffer

(d) GPU instruction buffer

Fig. 2. AVF results at the end of each workload’s execution

0

200

400

600

800

1000

Failu

re R

ate

CPU Register File

(a) CPU register file

0.E+00

5.E+05

1.E+06

2.E+06

2.E+06

3.E+06

Failu

re R

ate

GPU Register File

Failu

re R

ate

(b) GPU vector register file

0

100

200

300

400

500

600

Failu

re R

ate

VCC

(c) GPU VCC register

0

100

200

300

400

500

Failu

re R

ate

Instruction Buffer

(d) GPU instruction buffer

Fig. 3. Normalized failure rate at the end of each workload’s execution

0

0.2

0.4

0.6

0.8

1

1.2

0

10

20

30

40

50

166

131

196

261

326

391

456

521

586

651

716

781

846

911

AV

F (%

)

AV

F (%

)

Cycles (10M)

CPU GPU

0

20

40

60

80

100

1

62

12

3

18

4

24

5

30

6

36

7

42

8

48

9

55

0

61

1

67

2

73

3

79

4

85

5

91

6

AV

F (%

)

Cycles (10M)

Instruction Buffer

(a) nn

0

1

2

3

4

5

0

10

20

30

40

50

60

1

4296

8591

12886

17181

21476

25771

30066

34361

38656

42951

47246

51541

55836

60131

AV

F (%

)

AV

F (%

)

Cycles (10M)

CPU GPU

0

20

40

60

80

100

1

46

03

92

05

13

80

7

18

40

9

23

01

1

27

61

3

32

21

5

36

81

7

41

41

9

46

02

1

50

62

3

55

22

5

59

82

7

AV

F (%

)

Cycles (10M)

Instruction Buffer

(b) backprop

0

0.5

1

1.5

2

0

10

20

30

40

50

60

0 100 200 300 400

AV

F (%

)

AV

F (%

)

Cycles (10M)

CPU GPU

0

0.5

1

1.5

0 100 200 300 400

AV

F (%

)

Cycles (10M)

Instruction Buffer

(c) MatrixMul

Fig. 4. Time-varying AVF of CPU register file, GPU vector register file, and instruction buffer over the entire workload execution

Page 5: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

C. Time-varying AVF

Figure 4 is the time-varying AVF for CPU and GPU regis-ter files and instruction buffer while running the NN, backprop,and MatrixMul benchmarks. It is immediately obvious that alarge portion of each workload’s execution time is consumed byCPU-side execution. For example, each of the two AVF spikesin backprop denotes a GPU kernel execution and the time slotsbefore the first AVF spike, between the two spikes and afterthe second spike are periods that the CPU is executing code.We find that backprop executes two different kernel functionsand the code size of the second one is smaller.

As the graphs in Figure 4 show, the time-varying AVF ofthe GPU instruction buffer is zero when the CPU is executingcode and becomes non-zero when GPU-side kernel function isinvoked. The figure also shows that CPU register AVF is notzero during GPU-side execution.

One interesting aspect of Figure 4 is that the GPU registerfile’s AVF is not zero prior to GPU kernel invocation inMatrixMul and backprop. This is because these benchmarksread values from the GPU vector registers before doing anywrites, so any bit-flips that accumulate during the CPU-sideexecution will be read by the benchmark. This also explainsthe relatively high average GPU register AVF of MatrixMul,backprop, and hotspot.

Figures 5 and 6 ”zoom in” on the AVF for small sectionsof the GPU execution in NN and MatrixMul. Figure 5 showsclearly that the GPU register file AVF is less than 1.5% evenduring periods of heavy GPU computation. There are tworeasons for this. MatrixMul operates primarily out of the localdata store (LDS), a fast scratchpad memory within each CU,and loads values into a register only one instruction prior toconsuming the value. Therefore, only a small fraction of theregister file is used, and these registers have extremely shortdata lifetimes. NN exhibits a similar behavior, and also exhibitssignificant thread divergence in which only a portion of work-items execute on the correct path. Because work-items on thewrong path perform only unACE reads to registers, this servesto further reduce GPU register file AVF. The MatrixMul codebehavior is illustrated in Algorithm 1. The address values arestored in a scalar register and therefore do not contribute to theGPU vector register file’s AVF.

Figure 6 shows that the instruction buffer’s AVF is corre-lated to the AVF of the GPU register file. Essentially, entriesin the instruction buffer are ACE when the GPU is activelyissuing instructions, and many of these instructions use theregister file values. This is somewhat different than typical CPUregister file behavior in which register lifetimes often outlivethe instructions that produce them. Furthermore, the instructionbuffer’s AVF is very low even during active computationperiods. The delayed instruction fetch reduces the averagenumber of ACE bits that stay in the instruction buffer, resultingin a relatively low instruction buffer AVF for MatrixMul.

D. Impact of APU Sizing

When architecting an APU, a key decision is the number ofCUs in the APU. This determines the relative performance andpower of the APU. Therefore, it is also important to understandthe reliability impact of changing the number of compute units.Our results show that the reliability impact differs by workloadand does not lend itself to an intuitive ”rule of thumb”.

Algorithm 1 MatrixMul pseudocode1: movf1← constant2: loop :3: loadf4← lds[r1]4: loadf5← lds[r1 + x]5: fmaf6← f4, f5, f16: loadf7← lds[r1]7: loadf8← lds[r1 + x]8: fmaf9← f7, f8, f19: ...

10: end

For example, Figure 7 plots the GPU register file AVF whenvarying the number of compute units in the APU from one toeight in the MatrixMul and backprop benchmarks. The figuredemonstrates that changing the number of CUs does not havea predictable impact on reliability. In MatrixMul, the AVF ofthe register file decreases linearly with the number of CUs,offsetting the increase in the number of registers (because thereis one register file per CU). Furthermore, increasing the numberof CUs also reduces the computation time spent on the GPU,so the overall failure rate from the register files decreases whenincreasing the number of CUs.

In backprop, on the other hand, the register file’s AVF doesnot change when increasing the number of CUs. Therefore, theoverall vulnerability increases as the number of CUs increasesbecause the number of registers increases more than runtimedecreases.

VII. CONCLUSION

AVF analysis facilitates quantifying the reliability of aprocessor architecture in the early stages of its design, pro-viding the opportunity to incorporate RAS features that areeffective while allowing for maximum performance and energyefficiency. As claimed in several studies, reliability should betreated as a first-class concern in the design of APUs [8][14].This paper presents our research on an AVF modeling frame-work for APUs and our findings from the AVF characterizationof this architecture. We examined the AVF of multiple CPU andGPU structures and showed the AVF behavior over a program’sexecution. We also examined the effect of APU sizing on AVF.Our key finding is that changing APU size can either increaseor decrease reliability, depending on the workload that is beingexecuted.

REFERENCES

[1] “The gem5 simulator system.” [Online]. Available:http://www.m5sim.org

[2] “NVIDIA CUDA SDK 2.3.” [Online]. Available:http://developer.nvidia.com/cuda-toolkit-23-downloads

[3] “Rodinia:accelerating compute-intensive applications with accelerators.”[Online]. Available: http://lava.cs.virginia.edu/Rodinia/

[4] “AMD changes compute landscape as the first to bridge both x86 andARM processors for the data center,” in AMD Press Release, October2012.

[5] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, andR. Rangan, “Computing architectural vulnerability factors for address-based structures,” in Proceedings of the 32nd Annual InternationalSymposium on Computer Architecture, June 2005, pp. 532–543.

[6] A. Biswas, N. Soundararajan, S. S. Mukherjee, and S. Gurumurthi,“Quantized AVF: A means of capturing vulnerability variations oversmall windows of time,” in Proceedings of the 5th Workshop on SiliconErrors in Logic - System Effects, 2009.

Page 6: Architectural Vulnerability Modeling and Analysis of ...gurumurthi/papers/selse13_AVF.pdf · Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors Hyeran

0

0.5

1

1.5

0

10

20

30

40

50

60

24000 25000 26000 27000 28000

AV

F (%

)

AV

F (%

)

Cycles (100k)

CPU GPU

(a) nn

-0.5

0

0.5

1

1.5

2

2.5

0

10

20

30

40

50

60

11700 11800 11900 12000

AV

F (%

)

AV

F (%

)

Cycles (100k)

CPU GPU

(b) MatrixMul

Fig. 5. Time-varying AVF of CPU and GPU register files over the GPU portion of execution.

0

20

40

60

80

100

24000 25000 26000 27000 28000

AV

F (%

)

Cycles (100k)

Instruction Buffer

(a) nn

0

2

4

6

8

11700 11800 11900 12000A

VF

(%)

Cycles (100k)

Instruction Buffer

(b) MatrixMul

Fig. 6. Time-varying AVF of the GPU instruction buffer during the GPU portion of execution.

0

5

10

15

100 120 140 160

AV

F (%

)

Cycles (10M)

GPU Register File 1 CU 2 CU 4 CU 8 CU

(a) MatrixMul

0

2

4

6

52900 53100 53300 53500

AV

F (%

)

Cycles (10M)

GPU Register File 1 CU 2 CU 4 CU 8 CU

(b) backprop

Fig. 7. Impact of varying the number of compute units in the APU

[7] N. Farazmand, R. Ubal, and D. R. Kaeli, “Statistical fault injection-based AVF analysis of a GPU architecture,” in Proceedings of the 8thWorkshop on Silicon Errors in Logic - System Effects, March 2012.

[8] H. Jeon and M. Annavaram, “Warped-DMR: Light-weight error detec-tion for GPGPU,” in Proceedings of the The 45th IEEE/ACM Interna-tional Symposium on Microarchitecture, December 2012.

[9] G. Kyriazis, “Heterogeneous system architecture: A technical review,”in HSA Foundation Whitepaper, August 2012.

[10] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin,“A systematic methodology to compute the architectural vulnerabilityfactors for a high-performance microprocessor,” in Proceedings of the36th Annual IEEE/ACM International Symposium on Microarchitecture,December 2003, pp. 29–40.

[11] S. Mukherjee, Architecture Design for Soft Errors. San Francisco,Calif., USA: Morgan Kaufmann Publishers Inc., 2008.

[12] N. Seifert and N. Tam, “Timing vulnerability factors of sequentials,”IEEE Transactions on Device and Materials Reliability, vol. 4, no. 3,pp. 516–522, September 2004.

[13] V. Sridharan and D. R. Kaeli, “Using hardware vulnerability factors toenhance AVF analysis,” in Proceedings of the 37th Annual InternationalSymposium on Computer Architecture, June 2010, pp. 461–472.

[14] J. Tan, N. Goswami, T. Li, and X. Fu, “Analyzing soft-error vulnera-bility on GPGPU microarchitecture,” in Proceedings of the 2011 IEEEInternational Symposium on Workload Characterization, June 2011, pp.226–235.

[15] K. R. Walcott, G. Humphreys, and S. Gurumurthi, “Dynamic pre-diction of architectural vulnerability from microarchitectural state,” inProceedings of the 34th Annual International Symposium on ComputerArchitecture, 2007, pp. 516–527.

[16] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, “Techniquesto reduce the soft error rate of a high-performance microprocessor,” inProceedings of the 31st Annual International Symposium on ComputerArchitecture, June 2004, pp. 264–275.