slide 1 what happens before a disk fails? randi thomas, nisha talagala randit/iram/disklogs.html

32
Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala http://www.cs.berkeley.edu/~randit/Iram/ disklogs.html

Upload: blaise-gallagher

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 1

What Happens Before A Disk Fails?

Randi Thomas, Nisha Talagala

http://www.cs.berkeley.edu/~randit/Iram/disklogs.html

Page 2: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 2

Motivation• ISTORE:

– Proposes to take advantage of predicted failures to improve system robustness

– Uses a switched network design to connect intelligent devices to each other to improve system performance. » Therefore ISTORE devices do not share electrical

connections» Is this another ISTORE advantage?

• This talk examines:– The potential to predict failures for disk devices– If and how the failure of a device sharing

electrical connections with other devices affects those other devices

Page 3: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 3

Just Before a Disk Fails...

•Can we predict the disk failure? To answer we will investigate:– What kind of log messages does the system generate?

– When do these messages get generated?

– How do we distinguish a failing disk from a non-failing disk?

•Are the other connected devices in the system affected in any way? To answer we will investigate:– Are there correlations between the logged messages?

Page 4: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 4

* Which Logs on What System? –The Error Logs Generated by Berkeley’s Tertiary Disk System –Log Dates: January to November, 1998

* The Tertiary Disk Application

–A WEB Accessible Image Collection–Available 24 hours/day, 7 days/week

Page 5: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 5

Outline

* Tertiary Disk Architecture

• Example of a log Message

•What Kind of Messages are generated?

•Can we predict the disk failure?

•Are the other connected devices in the system affected in any way?

•Summary and Conclusion

Page 6: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 6

The Tertiary Disk Architecture

• 20 PCs (m0-m19):– 200 MHz Pentium Pros– 96 MB of RAM– Running FreeBSD version 2.2– Connected through a switched Ethernet network– Hosts a set of disks using fast-wide SCSI 2 in the

single ended mode» Using twin channel SCSI controllers

• Total of 368 Disks– 8 GB each– State of the Art in 1996

Page 7: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 7

The Tertiary Disk Architecture

• 4 PCs (m0 - m3) have 28 or more disks each:– 2-3 SCSI Chains per PC– 9-15 Disks per SCSI chain

• 16 PCs (m4 - m19) have 16 disks each:– 2 SCSI Chains per PC– 8 Disks per SCSI chain

•SCSI bus made up of:– SCSI cable: Connects the controller and enclosure– Backplane of the enclosure

Page 8: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 8

The Tertiary Disk Architecture

To Ethernet Switch

SCSI Cable

SCSI Backplane

Disk Enclosure

SCSIController

Ethernet

Terminator

Page 9: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 9

Outline

• Tertiary Disk Architecture

* Example of a log Message

•What Kind of Messages are generated?

•Can we predict the disk failure?

•Are the other connected devices in the system affected in any way?

•Summary and Conclusion

Page 10: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 10

Example of A Log Message

Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): WRITE(06). CDB: a c b1 bf 80 0

Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): HARDWARE FAILURE info:cb1bf asc:44,0

Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): Internal target failure field replaceable unit: 1 sks:80,3

• Month Day Time --> Oct 22 14:53:50• Machine name --> m6• Source of message --> kernel reporting message• Error Device --> disk = da1, SCSI bus = ahc0• Description of Error --> Write request had a write

fault and caused a HW Failure• More information --> Driver & SCSI Controller

Codes

Page 11: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 11

Outline

• Tertiary Disk Architecture

• Example of a log Message

* What Kind of Messages are generated?

•Can we predict the disk failure?

•Are the other connected devices in the system affected in any way?

•Summary and Conclusion

Page 12: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 12

What kind of messages are generated?

• Data Disk Error Messages:– Hardware Error: The command unsuccessfully terminated

due to a non-recoverable hardware failure. (Type is given in the message)

– Medium Error: The operation was unsuccessful due to a flaw in the medium --> usually recommends reassigning sectors

– Recoverable Error: The last command completed with the help of some error recovery at the target --> e.g. if the drive dynamically reassigned a bad sector to available spare sector

– Not Ready: The drive cannot be accessed at all

• SCSI Error Messages:– Time Outs: Can happen in any of the SCSI bus phases, i.e.

message, data, idle. Response: a BUS RESET command

– Parity: Cause of an aborted request

Page 13: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 13

Outline

• Tertiary Disk Architecture

• Example of a log Message

•What Kind of Messages are generated?

* Can we predict the disk failure?

•Are the other connected devices in the system affected in any way?

•Summary and Conclusion

Page 14: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 14

m0: SCSI Time Outs+Recovered Errors

02468

10121416

4/15/980:00

6/4/980:00

7/24/980:00

9/12/980:00

11/1/980:00

12/21/980:00

SC

SI

Bu

s 0

Dis

ks

SCSI Time Outs

Disk Recovered ErrorsSCSI Bus 0

Page 15: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 15

m0: SCSI Time Outs+Recovered Errors

02468

10121416

4/15/980:00

6/4/980:00

7/24/980:00

9/12/980:00

11/1/980:00

12/21/980:00

SC

SI B

us 4

Dis

ks

SCSI Time Outs

Disk Recovered ErrorsSCSI Bus 4

Page 16: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 16

m0: SCSI Time Outs+Recovered Errors

02468

10121416

10/15/9812:00

10/16/980:00

10/16/9812:00

10/17/980:00

10/17/9812:00

10/18/980:00

SC

SI B

us

0 D

isks

Disk Recovered Errors

SCSI Time OutsSCSI Bus 0

Page 17: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 17

m0: SCSI Time Outs+Recovered Errors

0

24

6

8

1012

14

16

10/16/98

12:28

10/16/98

12:43

10/16/98

12:57

10/16/98

13:12

10/16/98

13:26

10/16/98

13:40

10/16/98

13:55

10/16/98

14:09

10/16/98

14:24

SC

SI B

us

0 D

isks

Disk Recovered ErrorsSCSI Time Outs

SCSI Bus 0

Page 18: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 18

Can we predict a disk failure?

• Yes, we can look for Recovered Error messages --> on 10-16-98:– There were 433 Recovered Error Messages– These messages lasted for slightly over an

hour between:»12:43 and 14:10

•On 11-24-98: Disk 5 on m0 was “fired”, i.e. it was about to fail so it was swapped

•Another example...

Page 19: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 19

m11: SCSI Time Outs

0

2

4

6

8

10

8/17/980:00

8/19/980:00

8/21/980:00

8/23/980:00

8/25/980:00

8/27/980:00

SC

SI B

us

0 D

isks

SCSI Time OutsSCSI Bus 0

Page 20: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 20

m11: SCSI Time Outs+ Hardware Failures

0

2

4

6

8

10

8/17/980:00

8/19/980:00

8/21/980:00

8/23/980:00

8/25/980:00

8/27/980:00

SC

SI

Bu

s 0

Dis

ks

SCSI Time Outs

012345678910

8/15/980:00

8/17/980:00

8/19/980:00

8/21/980:00

8/23/980:00

8/25/980:00

8/27/980:00

8/29/980:00

8/31/980:00

SC

SI B

us

0 D

isks

Disk Hardware FailuresSCSI Time Outs

SCSI Bus 0

Page 21: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 21

Can we predict a disk failure?

•Yes, we can also look for Hardware Failure messages -->– These messages lasted for 8 days between:

»8-17-98 and 8-25-98

–On disk 9 there were:»1763 Hardware Failure Messages, and»297 Timed Out Messages

•Disk 9 on SCSI Bus 0 of m11 was “fired”, i.e. it was about to fail so it was swapped on 8-28-98

Page 22: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 22

Outline

• Tertiary Disk Architecture

• Example of a log Message

•What Kind of Messages are generated?

•Can we predict the disk failure?

* Are the other connected devices in the system affected in any way?

•Summary and Conclusion

Page 23: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 23

Are the other connected devices in the system affected in any way?

•Yes, observe the Time Out message traffic on other disks on the same SCSI bus for -->– The same 8 day period:

»8-17-98 and 8-25-98

•What about predicting other kinds of failures besides just disk failures? -->– Distinguishing between failing and non-

failing disks...

Page 24: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 24

m2: SCSI Bus 2 Parity Errors

SCSI Bus 2

0

5

10

15

12/26/970:00

1/5/980:00

1/15/980:00

1/25/980:00

2/4/980:00

SC

SI B

us

2 D

isks

SCSI ParityErrors

Page 25: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 25

SCSI Bus 2

0

5

10

15

9/2/980:00

9/12/980:00

9/22/980:00

10/2/980:00

10/12/98 0:00

10/22/98 0:00

SC

SI B

us

2 D

isks

SCSI Parity Errors

m2: SCSI Bus 2 Parity Errors

Page 26: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 26

Can We Predict Other Kinds of Failures?

•Yes, the flurry of parity errors on m2 occurred between:– 1-1-98 and 2-3-98, as well as – 9-3-98 and 10-12-98

•On 11-24-98– m2 had a bad enclosure --> cables or

connections defective– The enclosure was then swapped

•Note: The activity logs are not available for the earlier time period.

Page 27: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 27

Can We Distinguish a Failing Disk From a Non-Failing Disk?

• Yes...• SCSI Error Messages alone --> No

impending disk failure – As in the m2 Parity example

•Disk Error Messages alone or accompanied by SCSI Error Messages --> High Probability of an impending disk failure e.g.– ALONE: m0 had only Recovered Error Messages:

»Disk 5 was about to fail and therefore was “fired”

– BOTH: m11 had both Hardware Failure Disk Messages and Time Out SCSI Messages:»Disk 9 was about to fail and therefore was “fired”

Page 28: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 28

Outline

• Tertiary Disk Architecture

• Example of a log Message

•What Kind of Messages are generated?

•Can we predict the disk failure?

•Are the other connected devices in the system affected in any way?

* Summary and Conclusion

Page 29: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 29

Total Disk & SCSI Errors Per Machine

Total SCSI & DISK Errors Per Machine

0

500

1000

1500

2000

Machine

Nu

mb

er

of

Err

ors

DISK Parity

TimeOut

Page 30: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 30

Summary and Conclusion

•Disks don’t fail very often– In the 10 months of logs, only two disks failed– We have only 2 data points for these

conclusions!

•We can predict disk failures and other kinds of failures with enough time to do something about it

• There are correlations between the logged messages:– Hardware Failure Messages on one disk device

propagates as Time Out Messages on:»not only the failing disk, »but also other disks on the same SCSI bus

Page 31: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 31

Back Up Slides

Page 32: Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala randit/Iram/disklogs.html

Slide 32

m0: SCSI Time Outs

0

2

4

6

8

10

12

14

16

4/15/980:00

6/4/980:00

7/24/980:00

9/12/980:00

11/1/980:00

12/21/980:00

SC

SI

Bu

s 2

Dis

ks

SCSI Time Outs

SCSI Bus 2