phoenix: detecting and recovering from permanent...
TRANSCRIPT
![Page 1: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/1.jpg)
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs
with Programmable Hardware
Smruti R. SarangiAbhishek TiwariJosep Torrellas
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
![Page 2: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/2.jpg)
http://iacoma.cs.uiuc.edu2
Can a Processor have a Design Defect ?
No Way !!!
Yes, it is a major challenge.
![Page 3: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/3.jpg)
http://iacoma.cs.uiuc.edu3
A Major Challenge ???
50-70% effort spent on debugging
1-2 year verification times
Massive computational resources
Some defects still slip through to production silicon
![Page 4: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/4.jpg)
http://iacoma.cs.uiuc.edu4
Defects slip through ???
1994 Pentium defect costs Intel $475 million
1999 Defect leads to stoppage in shipping Pentium III servers
2004 AMD Opteron defect leads to data loss
2005 A version of Itanium 2 recalled
Does not look like it will stop
Increasing features on chip
Conventional approaches are ineffective
Micro-code patchingCompiler workaroundsOS hacksFirmware
![Page 5: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/5.jpg)
http://iacoma.cs.uiuc.edu5
VisionProcessors include programmable
HW for patching design defects
Vendor discovers a new defect
Vendor sends a defect signatureto processors in the field
Vendor characterizes the conditionsthat exercise the defect
Customers patch the HW defect
![Page 6: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/6.jpg)
http://iacoma.cs.uiuc.edu6
Additional Advantage: Reduced Time to Market
8 weeks
% o
f def
ects
det
ecte
d
Reduced time to market Vital ingredient of profitability
Pentium-M, Silas et al., 2003
![Page 7: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/7.jpg)
http://iacoma.cs.uiuc.edu7
Outline
Analysis and CharacterizationArchitecture for Hardware PatchingEvaluation
![Page 8: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/8.jpg)
http://iacoma.cs.uiuc.edu8
Defects in Deployed Systems
We studied public domain errata documents for 10 current processors
Intel Pentium III, IV, M, and Itanium I and IIAMD K6, Athlon, Athlon 64IBM G3 (PPC 750 FX), MOT G4 (MPC 7457)
% o
f def
ects
det
ecte
d
50100%
![Page 9: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/9.jpg)
http://iacoma.cs.uiuc.edu9
Dissecting a Defect – from Errata doc.
Defect
Module
Type of Error
Condition
L1, ALU, Memory, etc.
Hang, data corruptionIO failure, wrong data
A ∪ (B∩C∩D)
SignalSnoopL1 hitIO requestLow power mode
![Page 10: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/10.jpg)
http://iacoma.cs.uiuc.edu10
Types of Defects
Design Defect
Non-Critical Critical
Performance countersError reporting registersBreakpoint support
Defects in memory, IO, etc.
Concurrent Complex
All signals – same time Different times
![Page 11: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/11.jpg)
http://iacoma.cs.uiuc.edu11
31%
69%
Characterization
![Page 12: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/12.jpg)
http://iacoma.cs.uiuc.edu12
ALU
Memory, IO
When can the defects be detected ?
ConditionDetector
Signals
Pre Defect (63%)
Post Defect (37%)
Local Pipeline Other
Defect
time
![Page 13: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/13.jpg)
http://iacoma.cs.uiuc.edu13
Outline
Analysis and CharacterizationArchitecture for Hardware PatchingEvaluation
![Page 14: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/14.jpg)
http://iacoma.cs.uiuc.edu14
Phoenix Conceptual Design
Signature Buffer
Bug Detection Unit(BDU)
Global Recovery Unit
Signal Selection Unit(SSU)Reconfigurable
Logic
Store defect signaturesobtained from vendorProgram the on-chipreconfigurable logic
Tap signals from unitsSelect a subset
Collect signals from SSUsCompute defect conditions
Initiate recovery if a defect condition is true
![Page 15: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/15.jpg)
http://iacoma.cs.uiuc.edu15
Distributed Design of Phoenix
Subsystem
SSUBDU
Subsystem
BDUSSUHUB
Neighborhood
To RecoveryUnit
IO Cntrl. L1 CacheFetch UnitVirtual Mem.FP ALUInst. Cache
Examples of Subsystems
To RecoveryUnit
![Page 16: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/16.jpg)
http://iacoma.cs.uiuc.edu16
Overall Design
HUB
HUB HUB
HUB
Neighborhood
Neighborhood Neighborhood
Neighborhood
Global RecoveryUnit
Chip Boundary
![Page 17: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/17.jpg)
http://iacoma.cs.uiuc.edu17
Software Recovery Handler
Pipeline Post
Flush Pipeline
Type ofDefect
PreReset Module
Local Post Checkpointing Support
RollbackInterrupt to
OS
Rest of Post
Yes No
Turn condition off
continue
+
![Page 18: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/18.jpg)
http://iacoma.cs.uiuc.edu18
TrainingData
Designing Phoenix for a New Processor
New Processor
Sizes of StructuresList of Signals
Generic Specific
Learn from otherprocessors
Processordata sheets Scatter plot of sizes
vs. # of signals in unitDerive rules of thumbTraining
Data
![Page 19: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/19.jpg)
http://iacoma.cs.uiuc.edu19
Designing Phoenix for a New Proc. – II
Generate list of signals to tap
Decide on breakdown ofsubsystems and neighborhoods
Place BDUs, SSUs, and HUBs
Size structures using therules of thumb
Route all signals and realizethe logic function of defects
![Page 20: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/20.jpg)
http://iacoma.cs.uiuc.edu20
Outline
Analysis and CharacterizationArchitecture for Hardware PatchingEvaluation
![Page 21: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/21.jpg)
http://iacoma.cs.uiuc.edu21
Signals Tapped
Generic Signals Specific Signals
L2 hit, low power modeALU access, etc.
A20 pin set in Pentium 4BAT mode in IBM 750FX
Generic+Specific
150-270
![Page 22: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/22.jpg)
http://iacoma.cs.uiuc.edu22
Defect Coverage Results
All DefectsConcurrent
Com
plex
69% 31%
Pre Post
63%
37%Detect
RecoverTraining Set:Intel P3, P4, P-MItanium I & IIAMD K6, K7AMD OpteronIBM G3Motorola G4
Test Set:UltraSparc IIIntel IXP 1200Intel PXA 270PPC 970Pentium D
Test ProcessorsDetection Coverage
Recovery Coverage
65%
60%
![Page 23: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/23.jpg)
http://iacoma.cs.uiuc.edu23
Overheads
Overheads
Area TimingWiring
Programmable logic(PLA & interconnect)Estimated using PLAlayouts (Khatri et al.)
0.05%
Wires to route signalsEstimated using Rent’s rule
0.48%
None
![Page 24: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/24.jpg)
http://iacoma.cs.uiuc.edu24
Impact of Training Set Size
Train set only needs to have 7 processorsCoverage in new processors is very high
![Page 25: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/25.jpg)
http://iacoma.cs.uiuc.edu25
Conclusion
We analyzed the defects in 10 processorsPhoenix novel on-chip programmable HWEvaluated impact:
150 – 270 signals tappedNegligible area, wiring, and performance overheadDefect coverage: 69% detected, 63% recoveredAlgorithm to automatically size Phoenix for new procs
We can now live with defects !!!
![Page 26: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/26.jpg)
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs
with Programmable Hardware
Smruti R. SarangiAbhishek TiwariJosep Torrellas
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
![Page 27: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/27.jpg)
http://iacoma.cs.uiuc.edu27
Backup
![Page 28: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/28.jpg)
http://iacoma.cs.uiuc.edu28
Phoenix Algorithm for New ProcessorsGenerate Signal List
Place a SSU-BDU pairin each subsystem
Use k-means clustering to group subsystems in nbrhoods
Size hardware using thethumb-rules
Map signals in errata tosignals in the list
Route all signals and realizethe logic function
Similar results obtained for 9 Sun processors –UltraSparc III, III+, III++, IIIi, IIIe, IV, IV+, Niagara I and II
Defect Coverage for New Processors
![Page 29: Phoenix: Detecting and Recovering from Permanent …iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_micro06_1.pdfPhoenix: Detecting and Recovering from Permanent Processor Design Bugs](https://reader034.vdocuments.net/reader034/viewer/2022042021/5e782d096232360718264884/html5/thumbnails/29.jpg)
http://iacoma.cs.uiuc.edu29
Where are the Critical defects ?
The core is well debugged Most of the defects are in the mem. system