infrastructure - icpe2020.spec.org · example caching service. impact of smi due to ces:...

INFRASTRUCTURE

Hardware Sustaining. Facebook.

Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data CentersHarish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar

MAP : 2.45B

MAP: 1B

MAP : 1.3B

MAP : 1.6B

Globally, there are more than 2.8B people using Facebook, WhatsApp, Instagram or Messenger each month.

Source: Facebook data, Q4 2019*MAP - Monthly Active People

https://investor.fb.com/investor-events/event-details/2019/Facebook-Q3-2019-Earnings/default.aspx

• Server Architecture• Intermittent Errors• Memory Error Reporting• Interrupt Handling

• System Management Interrupts (SMI)• Corrected Machine Check Interrupts (CMCI)

• Experiment Infrastructure • Observations

Contents

5

Server Architecture

• Compute Units• Central Processing Unit (CPU)• Graphics Processing Unit (GPU)

• Memory• Dual In-line Memory Modules (DIMM)

• Storage• Flash, Disks

• Network• NIC, Cables

• Interfaces • PCIe, USB

• Monitoring• Baseboard Management

Controller (BMC)• Sensors, Hot Swap Controller (HSC)

CPUs

DIMMs

PCH

storage

GPU

USB

BMC

TPM

sensors HSC

fan controller

peripherals

NIC

6

Intermittent Errors – Occurrence and ImpactMachine Check Exceptions System Reboots

Correctable Errors

Bitflips Data Corruptions

Uncorrectable Errors

Interrupt Storms, System Stalls

System Reboots

ECC Errors

RS Encoding Errors

RetriesInput Output Bandwidth Loss

CRC Errors

Packet Loss

RetriesNetwork Bandwidth Loss

Correctable Errors

Uncorrectable Errors

Retries, Bandwidth Loss

System Reboots

CPU

DIMMs

Storage (Ex: Flash, Disks)

Network(Ex: NICs, Cables)

Interfaces(Ex: PCIe, USB)

7

Memory Error Reporting

CEs

UCEs

System Management Interrupts (SMI)

Corrected Machine Check Interrupts (CMCI)

mcelog

System Management Mode

Error Detection and Correction (EDAC)

Kernel Driver

Firmware Config

OS Daemon

8

System Management Interrupts

Machine Check HandlerNMI Handler

SMIFirmware

PlatformProcessor

Logging Handler

OS Error Handling

SMI Trigger:Memory correctable errors

SMI Handling:

Return from SMM

Capture Physical Address of the error

Perform Correctable Error (CE) Logging

Pause all CPU Cores

System Management Mode (SMM)

9

Corrected Machine Check Interrupts

EDAC

CMCI Handler

Machine Check Handler

NMI Handler

SMILogging Handler

OS Error Handling

Firmware

PlatformProcessorCMCI Trigger:Memory correctable errors

CMCI Handling:

3. Log aggregated CEs (count per poll)[randomly assigned core]

CPU stall on 1 core

2. Aggregate CEs

1. Collect CEs from each core

Repeat every specifiedpolling interval duration

EDAC kernel driver for error datacollection

Invoke CMCI Handler

10

Experiment Infrastructure

Daemon

MachineChecker

Alert Manager

run periodically and collect output

create alert if a server check fails

Failure Detection – MachineChecker• Runs hardware checks periodically• Host ping, memory, CPU, NIC, dmesg,

S.M.A.R.T., power supply, SEL, etc.

On the machine Off the machine

11


Daemon

MachineChecker

Alert Manager



Failure Detection – MachineChecker

Failure Digestion – FBAR• Facebook Auto Remediation• Picks up hardware failures, process logged

information, and execute custom-made remediation accordingly

FBAR


12


Daemon

MachineChecker

Alert Manager




Failure Digestion – FBAR

Low-Level Software Fix – Cyborg• Handles low-level software fixes such as

firmware update and reimagingFBAR

Cyborg


13


Daemon

MachineChecker

Alert Manager




Failure Digestion – FBAR

Low-Level Software Fix – Cyborg

Manual Fix – Repair Ticketing• Creates repair tickets for DC technicians to

carry out HW/SW fixes• Provides detailed logs throughout the auto-

remediation• Logs repair actions for further analysis

FBAR

Cyborg

Repair Ticketing

14

Experiment InfrastructureProduction System Setup

Production Machines

Configured with Step1: SMI Mem. ReportingStep2: CMCI Mem. Reporting

Remediation Policy

Swap Memoryat 10s of Correctable Errors per second

Benchmarks

Repro Memory Errors:Stressapptest

Detect Performance Impact:

SPEC (Perlbench)Fine grained stall detector

15

Experiment InfrastructureMemory errors in a production environment are random occurrences, and have no fixed periodicity, seen in experimental error injection setup.

1

10

100

1000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76

Cor

rect

able

err

or c

ount

/ se

cond

(log

scal

e)

Time (seconds)M1 M2 M316

Observation 1: System Management Interrupts (SMI) cause the machines to stall for hundreds of milli-seconds based on the logging handler implementation. This is measurable performance impact to report corrected errors.

6200 6200

7000

7300 7300

5600580060006200640066006800700072007400

0 20 40 60 80 100 120 140 160 180 2000

20

40

60

80

100

120

Cor

rect

able

err

or c

ount

Time (minutes)R

eque

st e

ffici

ency

(%)

Request Efficiency (%) Correctable Errors

Application ImpactExample Caching Service

Impact of SMI due to CEs:

Application request efficiencydrops by ~40%

Stall all cores of a CPU for 100s of ms

Default configs, trigger SMIs for every n errors (n=1000)

CEs increase (6200 7300)

17

Observation 2: Benchmarks like perlbench within SPEC are useful to quantify system performance. For variable events, we need to augment the benchmarks with fine-grained detectors to capture performance deviations.

Detect performance impact using benchmarking

Perlbench • Compare scores with and without SMI stalls. • Benchmarks return same scores

Stall detection• CPU Stall duration: 100s of ms• Fine-grained stall detection to observe CPU stalls

37.17 37.13

0

5

10

15

20

25

30

35

40

perlbench(+stressapp)

perlbench(+stressapp +

CEs)

Ben

chm

ark

scor

es

Stressapptest: Helps surface memory correctable errors due to bad DIMMs

No difference observed in scores with or without Correctable Errors (and the SMI stalls)

18

Minimizing performance impact using CMCI interrupts

EDAC

CMCI Handler

Machine Check Handler

NMI Handler

SMILogging Handler

OS Error Handling

Firmware

PlatformProcessorCMCI Trigger:Memory correctable errors

CMCI Handling:

3. Log aggregated CEs (randomly assigned core)

CPU stall on 1 core

2. Aggregate CEs

1. Collect CEs from each core

Repeat every specifiedpolling interval duration

EDAC kernel driver for error datacollection

Invoke CMCI Handler

19

Observation 3: SMI are several times more computationally expensive than CMCI for correctable memory error reporting in a production environment.

0

500

1000

1500

2000

2500

3000

3500

1000 2000 3000 4000 5000 6000 7000 8000

Cum

ulat

ive

stal

ll tim

e (m

s)

Number of correctable errorsSMI (all cores) CMCI (1 core)

SMI vs CMCI performance impact

SMI:• Stall all cores• Provide full physical

address of the error

CMCI:• Stall 1 CPU core

Graph: SMI stall time vs CMCI stall time vs Number of Errors

Results hold for M1, M2, M3 machine types since the stalls are a function of error counts. 20

82 121

357281

362440

842

11781242 1244

0

200

400

600

800

1000

1200

1400

1 2 4 6 8 10 20 30 35 40

Larg

est i

ndiv

idua

l sta

ll (in

ms)

EDAC polling interval (in seconds)

Every Polling Interval• Log aggregated CEs

(randomly assigned core)• CPU stall on 1 core

Optimizing Polling Interval• Tradeoff

• Error visibility frequency vs Individual CPU stall

• Modify polling interval • Obtain maximum individual

stall times per core

Observation 4: We see that with increased polling interval, the amount of time spent in individual aggregate logging by the EDAC driver increases.

21

3725038726

3164133869

3069733520

27845

24014

107568664

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

1 2 4 6 8 10 20 30 35 40

Tota

l sta

ll tim

e (in

ms)





• Error visibility frequency vs Total CPU stall

• Modify polling interval • Obtain total stall times

Observation 5: We see that with an increased polling interval for EDAC, frequent context switches are reduced. Hence the total time a machine spends in stalls will be reduced.

22

Observation 6: With increased polling interval for EDAC, we run the risk of overflow in error aggregation.




• Error visibility vs CPU stalls

• Modify polling interval • Measure counter overflows

and error count variations

0 0 0 0 0 0 0 0 0 1030

103

209

421

0

50

100

150

200

250

300

350

400

450

1 2 4 6 8 10 20 30 35 40 60 120

240

450

Erro

rs lo

st p

er p

oll


23

Minimizing performance impact using CMCI interrupts

37250

38726

31641

33869

30697

3325027845

24014

107568664

0

2

4

6

8

10

12

05000

1000015000200002500030000350004000045000

1 2 4 6 8 10 20 30 35 40 Mis

sed

erro

r cou

nt p

er p

oll

Tota

l sta

ll tim

e (m

s)

EDAC polling interval (seconds)Total Stall Time (ms) Loss Of Errors

Recommendations:• For measuring 10s of CEs

per second, use CMCI• At polling interval of ~37s

• Tradeoff:• Error visibility

• Maximized• Total Stall time

• Minimized

24

Post Package Repair (PPR)

Memory Error Repair• DDR4 Feature • Remaps faulty cells to healthy cells in memory• Requires physical address for performing PPR

• SMI provides physical address of error. • CMCI doesn’t provide physical address.

• Hard PPR (Preferred)• Persistent across reboots

• Soft PPR• Not persistent across reboots

To overcome this, Use a hybrid approach, CMCI in production flow, SMI in remediation flow

Address Decoder

Memory Cells Spare Cells

Remapped Bit [Post Package Repair]

Bad Bit

Data I/O

25

Hybrid Error Reporting ApproachDaemon

MachineChecker(If error > PPR

threshold)

Alert Managerrun periodically

and collect output

FBAR

Change Interrupts

(CMCI to SMI)Reduce SMI trigger thresholds

Production Machine (CMCI Interrupt)

Perform Hard PPR

Run Benchmarks (Memory Stress)

Change Interrupts

(SMI to CMCI)

26

Conclusion

SMI vs CMCI• SMIs results in stalls of 100s of ms in

production environments • Benchmarks can be augmented to be

sensitive to fine-grained stalls. • CMCI more efficient for reporting

memory errors in production. • CMCI can further be optimized by

tweaking polling intervals.PPR• Hybrid implementation to reduce perf

impact in production, and obtain benefits of PPR

0

500

1000

1500

2000

2500

3000

3500

1000 2000 3000 4000 5000 6000 7000 8000

Cum

ulat

ive

stal

ll tim

e (m

s)

Number of correctable errorsSMI (all cores) CMCI (1 core)

37250

38726

31641

33869

30697

3325027845

24014

107568664

0

2

4

6

8

10

12

05000

1000015000200002500030000350004000045000

1 2 4 6 8 10 20 30 35 40 Mis

sed

erro

r cou

nt p

er p

oll

Tota

l sta

ll tim

e (m

s)EDAC polling interval (seconds)

Total Stall Time (ms) Loss Of Errors27

Questions

Thank you

infrastructure - icpe2020.spec.org · example caching service. impact of smi due to ces:...

Documents