infrastructure - icpe2020.spec.org · example caching service. impact of smi due to ces:...
TRANSCRIPT
INFRASTRUCTURE
Hardware Sustaining. Facebook.
Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data CentersHarish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar
MAP : 2.45B
MAP: 1B
MAP : 1.3B
MAP : 1.6B
Globally, there are more than 2.8B people using Facebook, WhatsApp, Instagram or Messenger each month.
Source: Facebook data, Q4 2019*MAP - Monthly Active People
• Server Architecture• Intermittent Errors• Memory Error Reporting• Interrupt Handling
• System Management Interrupts (SMI)• Corrected Machine Check Interrupts (CMCI)
• Experiment Infrastructure • Observations
Contents
5
Server Architecture
• Compute Units• Central Processing Unit (CPU)• Graphics Processing Unit (GPU)
• Memory• Dual In-line Memory Modules (DIMM)
• Storage• Flash, Disks
• Network• NIC, Cables
• Interfaces • PCIe, USB
• Monitoring• Baseboard Management
Controller (BMC)• Sensors, Hot Swap Controller (HSC)
CPUs
DIMMs
PCH
storage
GPU
USB
BMC
TPM
sensors HSC
fan controller
peripherals
NIC
6
Intermittent Errors – Occurrence and ImpactMachine Check Exceptions System Reboots
Correctable Errors
Bitflips Data Corruptions
Uncorrectable Errors
Interrupt Storms, System Stalls
System Reboots
ECC Errors
RS Encoding Errors
RetriesInput Output Bandwidth Loss
CRC Errors
Packet Loss
RetriesNetwork Bandwidth Loss
Correctable Errors
Uncorrectable Errors
Retries, Bandwidth Loss
System Reboots
CPU
DIMMs
Storage (Ex: Flash, Disks)
Network(Ex: NICs, Cables)
Interfaces(Ex: PCIe, USB)
7
Memory Error Reporting
CEs
UCEs
System Management Interrupts (SMI)
Corrected Machine Check Interrupts (CMCI)
mcelog
System Management Mode
Error Detection and Correction (EDAC)
Kernel Driver
Firmware Config
OS Daemon
8
System Management Interrupts
Machine Check HandlerNMI Handler
SMIFirmware
PlatformProcessor
Logging Handler
OS Error Handling
SMI Trigger:Memory correctable errors
SMI Handling:
Return from SMM
Capture Physical Address of the error
Perform Correctable Error (CE) Logging
Pause all CPU Cores
System Management Mode (SMM)
9
Corrected Machine Check Interrupts
EDAC
CMCI Handler
Machine Check Handler
NMI Handler
SMILogging Handler
OS Error Handling
Firmware
PlatformProcessorCMCI Trigger:Memory correctable errors
CMCI Handling:
3. Log aggregated CEs (count per poll)[randomly assigned core]
CPU stall on 1 core
2. Aggregate CEs
1. Collect CEs from each core
Repeat every specifiedpolling interval duration
EDAC kernel driver for error datacollection
Invoke CMCI Handler
10
Experiment Infrastructure
Daemon
MachineChecker
Alert Manager
run periodically and collect output
create alert if a server check fails
Failure Detection – MachineChecker• Runs hardware checks periodically• Host ping, memory, CPU, NIC, dmesg,
S.M.A.R.T., power supply, SEL, etc.
On the machine Off the machine
11
Experiment Infrastructure
Daemon
MachineChecker
Alert Manager
run periodically and collect output
create alert if a server check fails
Failure Detection – MachineChecker
Failure Digestion – FBAR• Facebook Auto Remediation• Picks up hardware failures, process logged
information, and execute custom-made remediation accordingly
FBAR
On the machine Off the machine
12
Experiment Infrastructure
Daemon
MachineChecker
Alert Manager
run periodically and collect output
create alert if a server check fails
Failure Detection – MachineChecker
Failure Digestion – FBAR
Low-Level Software Fix – Cyborg• Handles low-level software fixes such as
firmware update and reimagingFBAR
Cyborg
On the machine Off the machine
13
Experiment Infrastructure
Daemon
MachineChecker
Alert Manager
run periodically and collect output
create alert if a server check fails
Failure Detection – MachineChecker
Failure Digestion – FBAR
Low-Level Software Fix – Cyborg
Manual Fix – Repair Ticketing• Creates repair tickets for DC technicians to
carry out HW/SW fixes• Provides detailed logs throughout the auto-
remediation• Logs repair actions for further analysis
FBAR
Cyborg
Repair Ticketing
14
Experiment InfrastructureProduction System Setup
Production Machines
Configured with Step1: SMI Mem. ReportingStep2: CMCI Mem. Reporting
Remediation Policy
Swap Memoryat 10s of Correctable Errors per second
Benchmarks
Repro Memory Errors:Stressapptest
Detect Performance Impact:
SPEC (Perlbench)Fine grained stall detector
15
Experiment InfrastructureMemory errors in a production environment are random occurrences, and have no fixed periodicity, seen in experimental error injection setup.
1
10
100
1000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76
Cor
rect
able
err
or c
ount
/ se
cond
(log
scal
e)
Time (seconds)M1 M2 M316
Observation 1: System Management Interrupts (SMI) cause the machines to stall for hundreds of milli-seconds based on the logging handler implementation. This is measurable performance impact to report corrected errors.
6200 6200
7000
7300 7300
5600580060006200640066006800700072007400
0 20 40 60 80 100 120 140 160 180 2000
20
40
60
80
100
120
Cor
rect
able
err
or c
ount
Time (minutes)R
eque
st e
ffici
ency
(%)
Request Efficiency (%) Correctable Errors
Application ImpactExample Caching Service
Impact of SMI due to CEs:
Application request efficiencydrops by ~40%
Stall all cores of a CPU for 100s of ms
Default configs, trigger SMIs for every n errors (n=1000)
CEs increase (6200 7300)
17
Observation 2: Benchmarks like perlbench within SPEC are useful to quantify system performance. For variable events, we need to augment the benchmarks with fine-grained detectors to capture performance deviations.
Detect performance impact using benchmarking
Perlbench • Compare scores with and without SMI stalls. • Benchmarks return same scores
Stall detection• CPU Stall duration: 100s of ms• Fine-grained stall detection to observe CPU stalls
37.17 37.13
0
5
10
15
20
25
30
35
40
perlbench(+stressapp)
perlbench(+stressapp +
CEs)
Ben
chm
ark
scor
es
Stressapptest: Helps surface memory correctable errors due to bad DIMMs
No difference observed in scores with or without Correctable Errors (and the SMI stalls)
18
Minimizing performance impact using CMCI interrupts
EDAC
CMCI Handler
Machine Check Handler
NMI Handler
SMILogging Handler
OS Error Handling
Firmware
PlatformProcessorCMCI Trigger:Memory correctable errors
CMCI Handling:
3. Log aggregated CEs (randomly assigned core)
CPU stall on 1 core
2. Aggregate CEs
1. Collect CEs from each core
Repeat every specifiedpolling interval duration
EDAC kernel driver for error datacollection
Invoke CMCI Handler
19
Observation 3: SMI are several times more computationally expensive than CMCI for correctable memory error reporting in a production environment.
0
500
1000
1500
2000
2500
3000
3500
1000 2000 3000 4000 5000 6000 7000 8000
Cum
ulat
ive
stal
ll tim
e (m
s)
Number of correctable errorsSMI (all cores) CMCI (1 core)
SMI vs CMCI performance impact
SMI:• Stall all cores• Provide full physical
address of the error
CMCI:• Stall 1 CPU core
Graph: SMI stall time vs CMCI stall time vs Number of Errors
Results hold for M1, M2, M3 machine types since the stalls are a function of error counts. 20
82 121
357281
362440
842
11781242 1244
0
200
400
600
800
1000
1200
1400
1 2 4 6 8 10 20 30 35 40
Larg
est i
ndiv
idua
l sta
ll (in
ms)
EDAC polling interval (in seconds)
Every Polling Interval• Log aggregated CEs
(randomly assigned core)• CPU stall on 1 core
Optimizing Polling Interval• Tradeoff
• Error visibility frequency vs Individual CPU stall
• Modify polling interval • Obtain maximum individual
stall times per core
Observation 4: We see that with increased polling interval, the amount of time spent in individual aggregate logging by the EDAC driver increases.
21
3725038726
3164133869
3069733520
27845
24014
107568664
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 2 4 6 8 10 20 30 35 40
Tota
l sta
ll tim
e (in
ms)
EDAC polling interval (in seconds)
Every Polling Interval• Log aggregated CEs
(randomly assigned core)• CPU stall on 1 core
Optimizing Polling Interval• Tradeoff
• Error visibility frequency vs Total CPU stall
• Modify polling interval • Obtain total stall times
Observation 5: We see that with an increased polling interval for EDAC, frequent context switches are reduced. Hence the total time a machine spends in stalls will be reduced.
22
Observation 6: With increased polling interval for EDAC, we run the risk of overflow in error aggregation.
Every Polling Interval• Log aggregated CEs
(randomly assigned core)• CPU stall on 1 core
Optimizing Polling Interval• Tradeoff
• Error visibility vs CPU stalls
• Modify polling interval • Measure counter overflows
and error count variations
0 0 0 0 0 0 0 0 0 1030
103
209
421
0
50
100
150
200
250
300
350
400
450
1 2 4 6 8 10 20 30 35 40 60 120
240
450
Erro
rs lo
st p
er p
oll
EDAC polling interval (in seconds)
23
Minimizing performance impact using CMCI interrupts
37250
38726
31641
33869
30697
3325027845
24014
107568664
0
2
4
6
8
10
12
05000
1000015000200002500030000350004000045000
1 2 4 6 8 10 20 30 35 40 Mis
sed
erro
r cou
nt p
er p
oll
Tota
l sta
ll tim
e (m
s)
EDAC polling interval (seconds)Total Stall Time (ms) Loss Of Errors
Recommendations:• For measuring 10s of CEs
per second, use CMCI• At polling interval of ~37s
• Tradeoff:• Error visibility
• Maximized• Total Stall time
• Minimized
24
Post Package Repair (PPR)
Memory Error Repair• DDR4 Feature • Remaps faulty cells to healthy cells in memory• Requires physical address for performing PPR
• SMI provides physical address of error. • CMCI doesn’t provide physical address.
• Hard PPR (Preferred)• Persistent across reboots
• Soft PPR• Not persistent across reboots
To overcome this, Use a hybrid approach, CMCI in production flow, SMI in remediation flow
Address Decoder
Memory Cells Spare Cells
Remapped Bit [Post Package Repair]
Bad Bit
Data I/O
25
Hybrid Error Reporting ApproachDaemon
MachineChecker(If error > PPR
threshold)
Alert Managerrun periodically
and collect output
FBAR
Change Interrupts
(CMCI to SMI)Reduce SMI trigger thresholds
Production Machine (CMCI Interrupt)
Perform Hard PPR
Run Benchmarks (Memory Stress)
Change Interrupts
(SMI to CMCI)
26
Conclusion
SMI vs CMCI• SMIs results in stalls of 100s of ms in
production environments • Benchmarks can be augmented to be
sensitive to fine-grained stalls. • CMCI more efficient for reporting
memory errors in production. • CMCI can further be optimized by
tweaking polling intervals.PPR• Hybrid implementation to reduce perf
impact in production, and obtain benefits of PPR
0
500
1000
1500
2000
2500
3000
3500
1000 2000 3000 4000 5000 6000 7000 8000
Cum
ulat
ive
stal
ll tim
e (m
s)
Number of correctable errorsSMI (all cores) CMCI (1 core)
37250
38726
31641
33869
30697
3325027845
24014
107568664
0
2
4
6
8
10
12
05000
1000015000200002500030000350004000045000
1 2 4 6 8 10 20 30 35 40 Mis
sed
erro
r cou
nt p
er p
oll
Tota
l sta
ll tim
e (m
s)EDAC polling interval (seconds)
Total Stall Time (ms) Loss Of Errors27
Questions
Thank you