reliability management 2

7/29/2019 Reliability Management 2

1/23

Reliability Management Overview

Course No: B03-004

Credit: 3 PDH

Daniel Daley, P.E., CMRP

Continuing Education and Development, Inc.9 Greyridge Farm CourtStony Point, NY 10980

P: (877) 322-5800F: (877) 322-4774

[email protected]


2/23

Reliability Ma nagement Overview

Introduction

The field o f Reliab ility Ma na gem ent or more d irec tly Reliab ility Eng ineering ha s bee n in

existenc e for a numb er of yea rs but rema ins misunderstoo d by ma ny individua ls in the

industry. You w ill find a num ber of individua ls assigned to p ositions in which they carry

the title of Reliab ility Eng ineer when in fac t the y perform the role o f a Struc tural

Engineer, a Rotating Equipment Engineer, an Electrical Engineer or an Instrumentation

Eng ineer. While the func tions a re no t tota lly exc lusive, engineers in the more

c onventional roles tend to foc us on the functionality or integ rity of a n a sset rathe r

tha n the reliab ility, ava ilab ility or ma intainab ility o f a n a sset .

While a comprehensive treatment of the entire field of reliability would take more time

than illustra ted herein, it is the ob jec tive of this c ourse to p rovide a high level ove rview

of the sub jec t. At the c onc lusion of this c ourse, the stud ent w ill und erstand the

difference between the role of a true Reliability Engineer and engineers in more

c onve ntiona l roles. The student w ill a lso und erstand ma ny of the eleme nts tha t must b e

ma nag ed to ensure tha t ea c h a sset w ill perform in a reliab le m anne r.

What Do You Have a Right to Expec t?

If they realize it or not, when most people purchase a new asset, they have certain

expe c ta tions c onc erning the reliab ility of an asset. Ove r the c ourse o f the life of tha t

asset, those same ind ividua ls have c ontinuing expe c ta tions c onc erning how tha t a ssetshould perform.

In a c om me rc ia l or industrial setting , the same it true. Senior manage rs of c orpo ra tions

have certain expectations concerning the reliability and availability of the assets they

ma nag e. As a result, they have expe c tations c onc erning how muc h prod uct c an be

produc ed and how much inco me the orga nization will prod uce.

As with any other characteristic of a physical asset, an important question one should

ask is, wha t do you have a right to e xpec t? For insta nc e, if you som eho w remo tely

orde red a new c ar but d id not spe cify the co lor and when the new c ar was de livered ,

it was b right yellow ; is there a basis for a c om pla int? If color was not spec ified, onemight a ssume that c olor wa s not important to the b uyer.

While ma ny of the d eta ils lea d ing to reliab ility pe rformanc e a re fa r mo re sub tle than the

color of a new car, the steps leading to assurance that they exist are equally black-

and -white. If the o wne r wa nts an asset to p erform w ith a spec ific level of reliab ility, he

must take the ac tions tha t will p rod uc e tha t pe rformanc e. If he fails to ta ke those step s,

1


3/23

the resulting pe rformanc e is muc h like the c olor of the c ar desc ribed above . It will be

the luck of the draw a nd w ill de pe nd on fac tors the ow ner has c hosen not to ma nag e.

Each and every phase during the life of an asset contains situations that can lead to

either go od reliab ility performa nc e or to po or reliab ility performa nc e. If the ow ner

wa nts to ensure goo d reliab ility p erforma nc e, it is imp ortant tha t he ta kes step s duringthose situa tions in a wa y tha t w ill ensure the leve l of reliab ility performa nc e he hop es to

ac hieve. After providing some of the ba c kground need ed b y the stude nt to

understand deta ils tha t w ill be later desc ribed , this c ourse will provide b rief desc riptions

of the steps that owners should take to ensure the desired level of reliability

performance.

Definitions

There are a num ber of imp ortant definitions assoc iate d w ith the stud y of reliab ility. The

definitions are critical in understanding the subtle elements that determine reliability

performance and the importance of the tools used to ensure adequate performance

for eac h of those e lements.

Let s beg in with the d efinition of the wo rd Reliab ility. In this c ontext, I will use the w ord

Reliab ility (cap ital R) to identify the d efinition of the te rm m ost pe op le are thinking of

when the y say reliab ility. I will use the w ord reliab ility (sma ll r) when refe rring to the

tec hnica l definition of t he spec ific p rop erty of a sset reliab ility.

When m ost p eo p le use the te rm Reliab ility, they a re ac tua lly thinking of a c ha rac teristic

tha t invo lves severa l d istinc t cha rac teristics. When d isc ussing goo d reliab ility or bad

reliability they are actually thinking about the widest spectrum of characteristicsassoc ia ted with the c harac teristic that c an either c ause the m c onfidenc e or co sts.

When using the term Reliability, most people are actually thinking of a characteristic

tha t c onta ins the c harac teristics of reliab ility, ava ilab ility and ma intainab ility. Eac h of

these c ha rac teristic s is d istinct a nd ea c h is the p rod uc t of sep ara te a c tivities. On the

other hand, each of these characteristics has a bearing on the others and therefore

should be d isc ussed tog ethe r.

To define reliab ility , it is a me asure of the likelihoo d tha t a device will avo id fa ilure

over a spec ific interval o f time. As a result, reliab ility is a sta tistica l mea sure tha t results

from unde rstand ing the a c tual number of failures that a n entire p op ulation of a de vic e

c an b e e xpe cted to end ure in a g iven interval.

It is important to understand that the actual reliability of a component is not the same

as the reliab ility of a c om plete asset. The reliab ility of a c om pone nt can vary based on

how it is app lied and how it is used . The reliab ility of a d ev ice is typ ica lly assoc iate d

2


4/23

with one o r mo re spe c ific Failure Mo de s that de termine how and when the d evice will

fail.

Ano the r te rm using the wo rd reliab ility is the term Inherent Reliab ility. The Inherent

Reliab ility of a n a sset , is the likelihood tha t the entire asset will survive w ithout fa ilure over

a spec ific p eriod of t ime . The Inherent Reliab ility of a c om plete a sset is based on thec onfiguration o f the asset a s we ll as the ind ividua l reliab ility of the c om pone nts used to

c onstruc t tha t asset. For instanc e, if the c onfiguration o f an a sset inc ludes red undant

c om ponents in highly critica l loc a tions, it is likely that the Inherent Reliab ility of the asset

will be highe r. Also, if c om ponents with highe r individua l reliab ility a re selec ted o ver less

expensive components with lower reliability, it is likely that the asset will have a higher

Inherent Reliability.

The Inhe ren t Reliab ility of an a sset defines the up per limit or the m aximum reliab ility

performa nc e the asset c an ac hieve. Ac hieving the full Inherent Reliab ility req uires tha t

the a sset b e operate d a nd ma intained as we ll as possible. If the asset is op erate d o rma intained in a sub -op timum m anner, it will not be p ossible to a c hieve the full Inherent

Reliability of the asset.

Another term that most people roll into their intuitive definition of Reliability is the

c ha rac teristic o f ava ilab ility. Ava ilab ility is a me asure o f the p ortion o f time an asset is

ab le to perform its intended func tion. The to ta l ava ilab ility of an a sset is typic a lly

red uce d by two fac tors:

1. Availability is reduced by the amount of time required to recover from

unp lanned fa ilures. This portion of lost ava ilab ility is dep endent on bo th: the

unreliab ility (or freq uenc y of fa ilures) and on the ow ners ab ility to respond to thefailures in a timely and effec tive m anner.

2. Availability is also reduced by the amount of time the asset is shut down to

perform planned p red ic tive or p reventive maintenanc e.

In simp le terms:

Ava ilab ility = Tota l Time (Planned Dow n Time + Unp lanned Dow n Time) / Tota l Time

Where,

Unp lanne d Dow n Time = Sum (Unp lanne d Fa ilures x Time to Respond to ea c h fa ilure)

Typica lly, major assets will req uire som e fo rm o f ma jor ma intenanc e e vent a t a numb er

of points ove r the life of the a sset. These ma jor ma intenanc e events a re c a lled

ove rhauls, turna round s or outa ges. Bec ause the se m a jor ma intenanc e events take so

long and c ost so m uc h mone y, they are a m ajor c onc ern.

3


5/23

Within each major asset, there are one or more components that tend to determine

how frequent ly the outa ges will nee d to take plac e. There are also one o r mo re

c om pone nts tha t dete rmine how long the a sset w ill be shut dow n for rep a irs or rene wa l.

We will call the component that determines the maximum length of time between

outages run-limiters and the components that determine the amount to time the

asset must b e shut dow n dura tion-sette rs .

The run-limiter is typica lly a wea ring c om ponent that be c om es wo rn to the p oint that

it c an no longer perform its intended func tion. By ana lyzing the a sset a nd identifying

the run-limiter it is possible to make that component more robust or capable of

end uring mo re we ar and thus extend the p eriod of time b etwe en outa ge s.

The dura tion-sette r is typic a lly a c om pone nt tha t is either buried dee p in the asset o r

one tha t req uires a time c onsuming renew a l p roc ess tha t enta ils a c ritic a l pa th durat ion

long er than a ny other c om po nent in the asset. Again, by identifying the duration-

setter it is possible to redesign the asset in a manner that reduces the critical pathdura tion and therefore the a mo unt of time the asset is out o f servic e.

The o ther cha rac teristic mentioned as an element tha t ma ny peop le include in their

intuitive definition o f Reliab ility is ma intaina b ility. Ma inta inab ility is a m ea sure of the

ab ility to restore the Inherent Reliab ility in a rata b le period o f time. There a re two

c ha rac te ristics important to m a inta inab ility. The first is the ab ility to resto re the Inhe rent

Reliab ility and the sec ond the a b ility to do so in a ra tab le period of time . A ra tab le

pe riod of time is a known o r rep ea tab le a mount of time .

The c ha rac teristic o f ma intainab ility is ea siest d esc ribed by show ing how one might

perform a ma intainab ility review o f a new a sset.

Eac h com pone nt of a ne w a sset ha s a spec ific reliab ility based on one o r mo re spec ific

Fa ilure Mo des and a n expec ted usab le life. During the usab le life of the a sset, ea c h

c om po nent will req uire bo th proac tive maintenanc e and rea c tive maintenanc e. The

proactive maintenance consists of the predictive and preventive maintenance tasks

needed to minimize unplanned failures by preventing deterioration and to restore the

c om po nent to go od as new c ond itions a t the end of its useful life. The reac tive

maintenance consists of repairs needed to restore asset functionality and Inherent

Reliab ility after an unp lanned failure.

Once all forms of proactive and reactive maintenance that will be needed over the

entire life of an asset, it is possible to review those tasks to see if they are maintainable.

If the tasks include steps that are of an unsure duration or that produce uncertain

results, the task is no t ma intaina b le . An examp le of a ta sk of unsure dura tion is one

tha t req uires the mec hanic or tec hnician to work in an unsa fe or awkwa rd p osition. An

example of a task that will produce uncertain results is one that concludes without a

4


6/23

functionality test o r one tha t c onta ins a step req uiring an atta c hment with an ad hesive

tha t requires spec ial co nd itions to c ure.

Knowing that specific tasks that will be required over the life of an asset are

unma intainab le, this gives the reliab ility enginee r the o pportunity to red esign the asset

to produce a product that is maintainable thus ensuring the desired reliability andavailability.

Returning to the introduction of this short course, the student should think about the

assets w ith which he is familiar. How ma ny of those a ssets ha ve b ee n expo sed to a

c om prehensive ma intainab ility review? How ma ny have b een a nalyzed to identify the

run-limiters and the duration-setters or have been redesigned to increase the

ava ilab ility? Ab sent the ma intaina b ility review, how is it possible to ensure the asset c an

perform a t the desired reliab ility and ava ilab ility over its entire life? Also, look a round a t

your current organization, who is expected to perform the maintainability review

desc ribed above ? While som e o rganizat ions have individua ls with the title Reliab ilityEngineer, few have constructed roles for individuals in those positions that will ensure

reliab ility, ava ilab ility and ma intainab ility performa nc e a t spec ific d esired levels.

Patterns and Relationships

Like so m uc h of eng ineering , the m anage me nt of Reliab ility de pend s on observation of

pa tte rns of events and the rela tionship o f those p a tte rns of events with fa ilures. Unlike

ma ny other eng ineering d isc ip lines, the observations of p a tterns and relations have not

been estab lished and c od ified by individua ls with na me s like Newto n, Planc k, Bernoulli,

Ohm a nd others found in text boo ks. In the business or reliab ility of your assets, you a re

the o ne w ho w ill need to rec ord information and ana lyze it to identify the p atte rns andrela tionships lead ing to the fa ilures of yo ur assets. Even if two of the same assets we re

purcha sed on the sam e d ay b y two different co mp anies, they wo uld be likely to have

d ifferent Reliab ility performa nc e. Tha t is bec ause no two c om panies use the ir assets in

exac tly the same wa y. As a result, the p a tterns of eve nts lea d ing up to a failure a nd

the usab le life and fa ilure modes a re likely to d iffer. For tha t rea son, it is important tha t

the Reliability Management Process and the Reliability Engineers become experts on

the p atte rns and relations spe c ific to the ir own c omp any.

One of the well known ways of describing the relationships between a specific pattern

of e vents and an a ssoc ia ted failure is a d iagram d esc ribing the P-F Interval of a spec ificFa ilure Mod e. In this c onte xt, the term P refe rs to the earliest point tha t the p ote ntial

for failure is know n to e xist. The term F refe rs to the fa ilure eve nt or the point a t which

the component in question has experienced the amount of deterioration needed to

produc e a fa ilure.

5


7/23

The c ha rt b elow is intend ed to desc ribe the P-F interval sta rting w ith the point whe n the

Tota l Base Numb er of a lubric ant has be c om e too low a nd the p oint when a related

failure occurs.

6


8/23

The follow ing a re the d efinitions of the seq uentia l elem ents on the c ha rt:

TBN Normal (Inherent or additive based reserve alkalinity or abili ty to neutralize acidity)

TBN Marginal

TBN Low

Source of acidity introduced

TAN Increases (Acidic concentration of oil)

pH Reaches a point where component deterioration is possible As evidenced by some

other measure

Component Deterioration

Deterioration to the point of added costs to repair or replace components

Deterioration to the point that engine failure is possible

TBN is Tota l Base Number of t he oil sample,

TAN is the Tota l Acid Numb er of the samp le.

When ana lyzing the P-F c ha rt in genera l, it is possible to say the following :

As long as the TBN is norma l, the re is little or no risk of d amage to the system a s a

result of a c id c onta mination.

When a ma rgina l TBN is dete c ted , there is a wa rning tha t som ething is

c onsuming the reserve alkalinity. There is no immed ia te c onc ern over ac id

related damage because with reserve alkalinity the lubricant cannot become

acidic.

When TBN is low , there is an imme d iate c onc ern ove r introd uc tion of a c idity.

The introd uc tion of a source of a c idity is typic a lly not som ething under the d irec t

c ontrol of the operator, so this event c an hap pen a t any time.

Unless the oil sampling process is on a frequent and highly structured routine,finding a samp le with high TAN c an be a hit-or-miss p roc ess.

Onc e the p H of the oil be c om es too low, any lubric a ted surfac e c an be subjec t

to ac id a ttac k.

7


9/23

Depending on how much acid exists in the system and how heavily the asset in

question is currently loaded, deterioration to sensitive components can begin

quite soo n.

At some point the amount of damage to components will add to the repair

c osts.

Finally, depending on the amount of deterioration and the current load on the

asset, a c ata strop hic failure c an oc c ur.

While the pattern described above is generic in nature, the timing and degree of the

elements it co nta ins is spec ific to ea c h owner. For insta nc e, one ma nufac turer of

loc om otive eng ines p rod uc es only four cyc le e ngines and the o ther prod uc es only two

c yc le eng ines. The likelihood of ac id-gas bypassing from e ng ine exhaust ga sses a re

d ifferent fo r the two kind s of eng ines. As a result, the rela tive timing a nd assoc iate d risk

of the P-F Inte rva l for these tw o typ es of eng ines will d iffer. The m anne r in which a

reliability engineer for a company using one type of engine will be different than a

reliab ility eng ineer for a c om pany exclusively using the othe r type of eng ine.

While the generic pattern and relationships for an entire industry may be very similar,

ultimately, the pa tterns and rela tionships a re very muc h comp any spec ific. A

significant portion of Reliability Management and Reliability Engineering is

understand ing the pa tterns and relationships imp ortant to their industry. Even m ore

important is applying that knowledge to the specific patterns of their own company

and plant.

In the example provided above, the rate at which the total base number maydete riora te may be simila r amo ng a va riety of different users. On the o ther hand , the

timing a t which the source o f ac idity is added ma y vary signific antly. By wa y of

exam ple, there are two ma jor manufac turers of freight loc om otives. The eng ines in the

units provided by one manufacturer are two-stroke engines and the other

ma nufa c turer p roduc es four-stroke eng ines. This d ifferenc e will result in some

differences in the rate at which the lubricating oils are exposed to blow-by exhaust

gasses; the mo st c om mo n source o f ac idity in eng ines. As a result, the sha pe o f the P-F

c urve a nd the timing of the rela tionship betw een TBN de terioration and eng ine

dama ge would b e d ifferent in these two c ases. The reliab ility eng ineer working for a

company using primarily two-stroke engines would need to respond to this pattern andthe associated relationship differently than the reliability engineer working for a

c om pany using p rimarily four-stroke eng ines.

The follow ing is another, fa r more c om mo n examp le tha t uses oil as a basis for

understanding patterns and relationships and how they may vary from situation to

situation:

8


10/23

In this c ase, the pa tterns beg in with ea sily c ontrollab le ac tivities including:

Timely oil c ha nges

Time ly o il sampling

Routine checking of oil temperature by sensing the temperature of the bearing

housing

Routine observation of the oil color by viewing the oil bulb on the side of the

bea ring ho using

The pa ttern of e vents lea d ing to ultimate fa ilure c ontinues with the follow ing step s:

Deterioration or thinning of the oil to the point that the film thickness between

the shaft and the be aring is no longer ad eq uate .

Dama ge to the rota ting shaft or the bea ring

Dama ge to the overall pum p where it will no long er perform its intend ed function

Ca tastrop hic failure o f the pump lea ding to a p lant fire.

The va lue o f unde rstand ing the pa ttern of events and the rela tionship be twe en early

signs of deterioration and events leading to failure is that the reliability engineer can

de sc ribe how be st to respo nd to them.

9


11/23

Timely Oil Change How determined?

Timely Oil Sampling How determined?

Check Bearing Housing Temp. Who does it and when?

Check Oil Color in Oiler Who does it and when?

Oil Film Thickness How can this be done?Bearing Damage How Detected?

Shaft Damage How Detected?

Pump Damage How Detected?

Fire - Plant Damage How Detected?

The first four items on the list a re relat ively simp le to mana ge. The first tw o, it is nec essa ry

only to dete rmine the m ost cost effec tive timing to perform the tasks. The sec ond two

items dep end on d ec iding w ho in the o rga nization ha s the b est op po rtunities (ava ilable

time and rea dy ac c ess to ma ke ob servations).

The fifth item , determining the o il film th ickness in a time ly ma nne r when it begins to thin

is impossible. The last four items are a c tivities tha t have to o c c ur a t a time w hen the

reliability eng ineer has no c ontrol over. Onc e be aring d am ag e has be gun, the

following events can happen instantaneously or they can take quite some time to

oc c ur. As a result, the P-F c urve for this pa rticu lar situation te lls us tha t we need to a c t

on the a c tivities over whic h we ha ve c ontrol. We need to c hang e oil on time, sam p le it

on a regular interval, monitor temperature regularly and monitor color on frequent,

reg ular inte rvals.

While the two examples provided above are very specific in nature, the concept ofbeing observant and identifying patterns leading to deterioration and their relationship

with ultima te failures is one tha t c an b e app lied to numerous situations in a n industrial or

c om merc ial setting.

Roles in Reliability

While there are a number of roles in a typical maintenance or reliability organization,

there are several that should be emphasized as to their importance in relation to

performing the day-to-day activities involving processing information and investigating

failures.

Diagnostician In the process of processing the activities that occur as the result of a

system failure, one role tha t is impo rtant to the e ffec tive and efficient use o f information

is the Diagnostician. While this might be a single job in a large o rganiza tion, it might a lso

be a sha red job in a sma ller orga niza tion. The important p oint is tha t the a c tivity of

performing a d iagnosis is one tha t is sep ara te and d istinct.

10


12/23

A d iag nosis is an ac tivity that c an be performed rem otely ba sed solely on the c ontent

of a properly structured Malfunction Report and the information contained in historical

files and c urrent operations da ta . Based on tha t informa tion, a d iagnostician should b e

able to identify the most likely Failure Mode as well as other possible Failure Modes in

ranked o rde r. The d iag nostician provides instruc tions to the t roub leshooter who is next

involved in the c hain of eve nts lea ding to a rep a ir.

Troub leshooter Troub leshoot ing is an inva sive a c tivity tha t e nta ils d isassem bly of the

failed system to identify the failed c om po nent and the co ndition of that co mp onent. A

troubleshooter should be provided with instructions concerning where to start by the

d iagnostician. Sta rting d isassem b ly without know ing the loc a tion of the m ost likely

Failure Mode is problematic, because it leads to wasted time and occasionally,

introd uc tion of new defec ts into the system tha t d id not e xist befo re the troub leshoo ting

began.

The troub leshoo ter identifies the fa iled c om pone nt and p rovide s a d eta iled d esc rip tionof a ll the a ctions neede d to p erform a co mp lete a nd thorough rep air.

Failure Analyst Once the repairs are underway or complete, another step

ac c om plished by an o rga niza tion interested in lea rning and p reventing future failures in

Fa ilure Ana lysis. Fa ilure Ana lysis is the step tha t identifies the Fa ilure Mec ha nism. While

the Fa ilure M od e is the result o f dete riora tion tha t ha s been oc c urring for som e time up

to the failure, the Failure Mechanism is natures tool for forcing the deterioration to

occur.

In mec hanica l comp one nts, there a re fo ur forms of Fa ilure Mec hanisms:

Corrosion

Erosion

Fatigue

Overload

As described above, there mechanisms causing deterioration exist in nature and if the

tec hniques used to p revent them a re no t ma intained, the d eterioration w ill be allow ed

to p roc eed unab ate d until the failure Mo de is present and a failure c an oc c ur.

For instanc e, a p rote c tive c oa ting may prevent uniform c orrosion. Absent, the

presence of the coating, uniform corrosion can occur resulting in metal loss, thinning

and ultimate fa ilure. Another examp le is the rubb er sea l tha t keep s mo isture out of an

elec tric c ab inet. Inside the c ab inet are a variety of dissimila r me ta ls. If the sea l is not

ma intained, wa ter ca n intrude setting up a ba ttery betw een more nob le a nd less nob le

me ta ls. Ultima te ly this c an either result in me ta l loss lead ing to fa ilure or it c an prod uc e

11


13/23

suffic ient c orrosion p rod uc ts tha t w ill find their wa y into the c onta c ts and c ause further

deterioration and failure.

In order to understand the Failure Mechanism that led up to the Failure Mode, it is

important for a Failure Analyst who is familiar with the various forms of Physics of Failure

to investiga te a ll significa nt fa ilures. One point to keep in mind is tha t if som e fo rm o fFa ilure Me c ha nism is ac tively wo rking in one part o f your asset , it is likely working in other

pa rts. Identifying and eliminating tha t Fa ilure M ec hanism w herever it exists c an p revent

a numb er of fa ilures; not just the one tha t was investiga ted .

Cause Analyst While most failures have a variety of causes, there are at least three

levels of c ause:

Physica l Ca use

Huma n Ca use

Systemic Ca use

At the lowest level, there is always one or more Physical Causes that unleash natures

Fa ilure Mec hanism so dete riora tion ca n begin. In the ca ses desc ribed above , the

absence of the protective coating or the absence of the rubber seal is the Physical

Ca use that a llow ed the direc t co ntac t of wa ter with the meta l surfac es. In eac h ca se,

tha t resulted in the Fa ilure Mec ha nism o f Corrosion sta rting the p roc ess of d eteriora tion.

At a leve l above the Physica l Ca use is the Huma n Ca use. The Huma n Ca use is a

specific person who either acted or failed to act in a manner that resulted in the

Physica l Ca use. In the c ase o f the rubber sea l on the elec tric a l enc losure, seve ra lindividua ls ma y be the p hysic a l c ause:

If the maintenance work order called for maintaining the rubber seal at some

point and the work was not done, the crafts person who was assigned to

perform the wo rk orde r ma y be the Huma n Ca use.

If the e lec tric al eng ineer assigned to c rea te the ma intenanc e wo rk order failed

to identify the need to regularly inspe c t a nd ma intain the rubb er sea l, he might

be the Huma n Cause.

If the ma nag er over the a rea where the elec tric al enc losure is loc ate d de c ided

to save some money by removing the seal maintenance task from the work

order, he w ould be the Huma n Ca use.

In any case, it is imperative that the specific person who is the Human Cause be

ident ified . Spea king to tha t person is the only wa y the System ic c ause w ill be identified .

12


14/23

The Systemic Ca use is best d esc ribed as a trap tha t exists in the o rganiza tion, the

procedures, the accepted practices or the behaviors of the overall organization that

a llow s the Human Ca use to ac t in a m anner that p rod uc es the Physic a l Ca use.

In the e xam ples de sc ribe d ab ove:

If the crafts person can pick and choose what portions of the work order he

wa nts to c om p lete , the c ulture has c rea ted a trap that has led to this failure.

If the electrical engineer does not have the time or has not been provided the

guidance to get out in the field and identify the need for replacing the seal,

the re is anothe r systemic wea kness.

If cost cutting and budgetary restraints have caused the manager to remove

nee ded wo rk from the w ork orde r, there is a d ifferent system ic c ause.

In any ca se, the System ic Ca use tha t is identified c an affec t a m uc h broa der array ofissues tha n just the one b eing investiga ted . The p erson p erforming C ause Ana lysis c an

either be the same person o r c an be a d ifferent pe rson tha n the Fa ilure Ana lyst. The

two roles typically require different skill sets and are accomplished in different settings,

so a single ind ividua l assigned to b oth roles typic a lly doe s one bette r than the other.

Governing Patterns in the Life of a Sing le Failure

The desc riptions provided ab ove tend ed to foc us on spec ific deta ils c onc erning either

the pa tte rns and rela tions lead ing to fa ilures or the roles and skills involved in a reliab ility

organization.

This sec tion w ill provide the first o f two gene ra l pa tterns tha t a re impo rtant to

understand when creating systems that will properly address failures, gather information

on fa ilures and take the step s needed to e nsure a long and reliab le life fo r an asset.

The first o f the ge neral p a tterns to be desc ribed is one tha t ide ntifies the step s tha t

oc c ur in either an o vert or pa ssive ma nner during the life of a single failure event. If

hand led in the p rop er manne r, these step s c an lea d to e ffec tive a nd effic ient ha ndling

of the single event . If hand led we ll, these step s c an also lead to a permanent solution

to the p rob lem s that c aused the failure to oc c ur. Handled p oo rly, the defec t ca using

the imm ed iate e vent ma y not be rem oved . This c an lead to a repe at o f the failure and

a ge neral dete riorat ion of the inherent reliab ility of the system .

The t itle used herein for this pa tte rn is the Path to Fa ilure and Co rrec tive Ac tion b ec ause

it iden tified a ll the d isc rete step s lead ing to a failure a nd resolving it. While m any

organizations may not openly acknowledge the presence of each step in the process

as it is described, the steps exist and are being handled in a passive manner without

spec ific foc us or intentiona l mana ge ment.

13


15/23

As desc ribed above , any failure event typ ic a lly beg ins with a System ic Ca use. The

System ic Ca use c rea tes a trap into which an ind ividua l c an fall. The ne xt step is the

Human Cause in which a specific individual either does something or fails to do

something tha t produc es a Physica l Ca use. The third step is the Physica l Ca use. This is

the step in which the form of prevention used to restrain natures Failure Mechanisms are

ignored.

The next step is the Fa ilure Me c ha nism. The Fa ilure Me c ha nism (like c orrosion, erosion,

fat igue o r ove rload ) is a na tural p roc ess tha t causes on-going d ete riora tion. At som e

point in time, the amount of deterioration has reached the point that a defect is

p resent. The d efec t is not the same as the fa ilure. In the c ase o f co rrosion, the d efec t

might be the point at which deterioration has removed enough metal that the

c om po nent is no longer ab le to handle the loa d a t its ma ximum c ond ition. Onc e the

defec t is p resent, the fa ilure c loc k is ticking and the failure is wa iting only on a situation

when the loading is suffic ient to c ause a break at the po int of m aximum de terioration.

In the c ase o f a rusted p ipe , the leak may not oc c ur until the system p ressure is raised to

highe r tha n norma l p ressure.

14


16/23

Onc e the fa ilure oc c urs, the ne xt step is the Ma lfunc tion Rep ort. In som e c ases, the

structure used for issuing a Malfunction Report ensures that clear and accurate

informa tion is p rov ided . In other ca ses, the informa tion p rov ided is useless and p rovides

little he lp in driving to wa rd a n ac c ura te d iagnosis and an ultimate solution.

The next three step s tend to w ork tog ethe r in the effo rt to iden tify the Fa ilure Mod e a ndthe ultima te resolution. The Diag nosis is the hand s-off, remo te a c tivity of using a va ilab le

informa tion to ide ntify likely Fa ilure Mod es. This informa tion c an be used to perform

triage or prioritizing the hand ling of c urrent issues. It c an a lso b e used for ad vanc ed

prepara tion o f too ls and ma teria ls. The next step is funne ling or ana lysis of a ll possible

Fa ilure Mo de s to d etermine w hic h should be a pp roa c hed and in wha t order. The m ost

likely Fa ilure Mo de is typic a lly first to b e a pproa c hed by the troub leshoo ter. On the

othe r hand , there m ight b e a relative low likelihoo d Fa ilure M od e tha t requires little time

or effort to reset . That item might be a ttemp ted first on the o ff-chanc e it prod uce s

immediate resolution. Rec yc ling a co mp uter is an exam ple of suc h a fix.

Troub leshoot ing is the third of these a c tivities. Since it is ha nd s-on a nd invasive, it ta kesthe m ost time a nd it ca n also introd uc e new defec ts into the system . Highly direc ted

troub leshoo ting typic a lly lea ds to the mo st effec tive and efficient rep a irs.

The next tw o item s a lso fit to gethe r in te rms of identifying the Fa ilure Mode tha t c aused

the failure. The troub leshoo ter should identify the defec tive c om pone nt that caused

the failure. He should also d esc ribe the c ond ition of tha t co mp onent that lea d to the

defec t. Freq uently rep lac ed c om po nents that restore functionality a re not d efec tive.

Simp ly get ting inside and shaking things up restores func tiona lity by resto ring p oo r

c ontac ts. In this c ase, the rep lac ed c om pone nt is not defec tive. In this c ase, the true

defect has not been removed and the inherent reliability of the system has not beenrestored . Thus it is c ritica l tha t the troub leshoo ter produc e a truly defec tive co mp one nt.

If is also important that the troubleshooter describe the condition of the defective

c om ponent. Witho ut suc h a desc ription, it is impossible to identify the Fa ilure

Mec ha nism. Without know ing the Fa ilure Mec ha nism, it is impossible to e limina te the

na tura l process forcing d ete rioration.

The next two steps in the p roc ess a re Fa ilure Ana lysis and Ca use Ana lysis. As desc ribed

ea rlier, Fa ilure Ana lysis is an a na lysis of the physics of fa ilure to determine the source of

the d ete riora tion lea d ing to fa ilure. Ca use Ana lysis is the hum anistic a nd orga niza tiona l

pa rt of the Co rrec tive Action p roc ess.

The c ha rt p rovide d a bove includes two othe r elements. The first is ident ific a tion of

Fa ilure Mec hanisms tha t a re in the p roc ess of p rod uc ing dete riora tion befo re fa ilure. It

is imp ortant to no te tha t the Fa ilure Mec hanism is ha rd a t wo rk for a long period befo re

the c om po nent is dete riora ted to the po int tha t failure is possible. If the Fa ilure

15


17/23

Mechanism is identified and remedied before the failure, it is possible to prevent the

failure from oc c urring.

The sec ond eleme nt highligh ts the o pportunity to ident ify the d efec t a fter it has formed

but b efore the failure ha s oc c urred . As mentioned ea rlier, the formation of the d efec t

and the failure a re no t alwa ys simultaneo us. If an a lert person c an find a defec t tha texists befo re a fa ilure, it is a lso p ossible to p revent the fa ilure from oc c urring.

When ea c h of the se step s a re rec og nized and hand led c orrec tly, it is possible to:

Hand le ea c h individua l inc ide nt in an e ffective and effic ient ma nner

Gather the information needed to handle future incidents in an effective

manner

Ga ther informa tion need ed to p reve nt this incide nt in the future

Gather information needed to prevent similar incidents caused by the same

Fa ilure Me c ha nism or Fa ilure Mo de

Investigation of the typical patterns of response to individual failures and how those

patterns ultimately result in either permanent solutions or reduced inherent reliability will

be useful in identifying the actions that must be taken to improve reliability

performance.

Governing Patterns in the Overall Life of a n Asset

The ove ra ll lifec yc le of a n a sset c an b e b roken into a series of p roc esses tha t ultima tely

determine the reliab ility, ava ilab ility and longev ity of the a sset . Those p roc esses includ e

the following:

Design a nd build The inherent reliab ility of an a sset is based on the

configuration of the asset as well as the individual reliability of the components

tha t were selec ted . Ma ny c onve ntiona l design processes address only the

func tiona lity and the struc tural rob ustness of a n asset . While nec essa ry, these

forms of analysis do little to ensure the reliability, availability or maintainability of

an asset . Also, they do not de termine the usab le life of the asset . In order to

address those characteristics, it is necessary to perform Design for Reliability

c onc urrently with co nventiona l design ac tivities.

Operate Assuming that an asset has been designed in a manner that the

Inherent Reliability is capable of providing the performance desired by the

ow ner, tha t is only the beg inning . In order to p rovide the required performa nc e,

the asset must be operated and maintained in a manner that harvests all the

16


18/23

Inherent Reliab ility. There a re tw o things tha t an op erato r of a n asset c an do to

support goo d reliab ility.

First, the operator can operate the asset in a manner that he will Do No Harm.

If the operator understands the Failure Modes and the Failure Mechanisms that

ma y dete riorate the e quipm ent he o pe rate s and he und erstand s the c hoices hecan make that will either prevent or cause deterioration, he can choose the

path tha t will red uc e the am ount o f harm that results from po or ope ration.

Sec ond , the ope rato r c an Do Som e Good as part of his op erating p rac tic es.

In ea rlier disc ussions of P-F interva ls, the o pportunity for someo ne to regularly

ob serve o il tem pe rature and oil c olor we re m entioned. Op erators are typica lly

the resource with the greatest and most frequent opportunity to make these

ob servations. If the op erato r takes ac tion whe n the bearing housing of

equipment items are too hot or when the oil is just beginning to discolor, it is

possib le for the op erator to Do Som e G ood .

While the examples provided are fairly simple, there are myriad examples for

operators to modify the way they interface with equipment to positively affect

the equipment reliability so deterioration is avoided and the full Inherent

Reliab ility is ac hieved.

Inspec t Throug hout the life o f an asset , there a re situat ions in which individua ls

with spec ialized expe rtise p erform inspec tions of the eq uipment. Freq uently the

individuals are trained to recognize corrosion, vibration, electrical system

deterioration and a variety of other forms of specialized patterns that provide

c lues pertaining to incipient o r on-going d ete rioration. These individua ls a remade more effective if they are made aware of Failure Modes and Failure

Mec hanisms known to be p resent in the ac tua l system s being inspec ted . Rather

than looking for everything, they can focus their attentions on problems that

have hap pened in the pa st and a re likely to oc c ur aga in in the future.

Maintain One of the on-going routines during the lifecycle of an asset is the

need to perform ma intenanc e. There a re two forms of ma intena nc e: proa c tive

and reac tive. Proa c tive Ma intenanc e c an e ither be Pred ic tive or Preventive.

Predictive Maintenance is typically non-invasive and uses special tools and

techniques to identify signs of incipient failure or on-going deterioration.Activities performed as part of an inspection is typically intended to be

p red ictive in na ture. Preve ntive Ma intenanc e is anothe r form of Proa c tive

Ma intenanc e. Preve ntive Ma intena nc e is tangib le ac tivities performe d to

excha nge o r renew a c omp onent ba sed on knowledg e that t ime ha s co me for

the rep lac eme nt to oc cur.

17


19/23

Reactive maintenance is the form of maintenance that is accomplished in

response to a failure. The Path to Fa ilure a nd C orrec tive Ac tion de sc ribed above

provides a comprehensive description of the steps included in reactive

maintenance.

In either case (proactive or reactive maintenance) the objective is not to simplyresto re or ensure the func tiona lity of an a sset . The ob jec tive is to m ainta in or

resto re the inherent reliab ility of the asset .

Ove rhaul, Turnaround or Outa ge On som e reg ula r but no n-routine b asis ma jor

assets need to be maintained using a major effort that is called an overhaul,

outa ge o r turna round dep end ing on the industry. These events a re highly c ostly

and they have the single greatest impact on the availability of the asset

involved . Most system s have e ither one o r a sma ll numb er of comp one nts tha t

reach the end of their useful life and, as a result, set the timing when the

ove rhaul, outa ge or turnaround must oc c ur. For simplic ity we c an ca ll these run-limiters . There is a lso one o r a sma ll num ber of comp onents tha t req uire the

longest critical path of activities during the overhaul, outage or turnaround.

Aga in for simplic ity, we c an c a ll these item s the dura tion-sette rs .

Apart from reducing the number of failures and thus increasing the amount of

unplanned outa ge time , the best w ay to imp rove the a vailab ility of a n a sset is to:

1. Ma ke the run-limiters mo re rob ust so the a sset c an run long er be twe en

outages

2. Make the duration-setters more maintainable so the duration ofoutages is shorter

Modific a tion Ma jor assets a re freq uently mo d ified over the ir lifec yc le. They

c an b e m od ified to a lter their pe rforma nc e o r to increa se their c ap ac ity. It is not

uncommon for modifications to be accomplished without adequate attention

pa id to the Design fo r Reliab ility. In those c ases, the resulting reliab ility of the

mo d ified asset is less tha n the asset b efo re m od ific a tion.

Renewal As with modification, great many assets go through a renewal

p roc ess to b rea th new life into ag ed assets. As with mo d ific a tions, it is not

unco mm on for inad eq uate atte ntion be ing pa id to Design for Reliab ility during

the renewa l p roc ess. In this c ase, the a sset is onc e a ga in p rep a red for a long

but unreliab le life.

An important point to keep in mind is that it is critical to remain vigilant over the entire

lifec ycle o f an a sset to ensure g oo d reliab ility. You c annot b e vigilant 90% of the time

18


20/23

then drop your gua rd . The dam ag e done d uring eve n a short pe riod of inattention c an

result in a loss of c ritic a l cha rac teristics tha t it will be d ifficult o r impossible to rep lac e.

The imp ortanc e in und erstand ing the pa ttern o f events and proc esses that oc c ur over

the entire lifecycle of an asset comes in being able to determine if reliability is being

properly assessed a t those times. For insta nc e, there a re organiza tions tha t havereliability analysis well integrated with the processes where the engineering discipline is

involved . Since initial design, mod ifica tions and rene wa l typ ic a lly involve enginee rs in

the activity; those three activities would be the ones most likely to benefit from that

involvem ent. In som e orga niza tions, the op posite c an also be true. Ac tivities hand led

by the p lant resources (ope ra tion, ma intenanc e a nd inspe c tion) ma y be nefit from the

involvement of reliability engineers in those activities while project managers who are

focused solely on cost and schedule may refuse to address additional requirements

that a dd co st or take m ore time.

For an asset to be truly reliable, it is important that elements and activities leading togoo d reliab ility are a ddressed on a ll occ asions.

Too ls for Eac h Phase in the Life o f an Asset

Each point in the life of the asset has activities that can help improve the reliability,

ava ilab ility and ma inta inab ility of an asset. On the other hand if these cha rac teristics

are ignored at any of those points, the characteristics needed to produce reliable

performa nc e c an be lost. The follow ing de sc ribes the to ols typic a lly assoc iated with

ea c h p hase in the lifec ycle of a n a sset:

Design and build During the design and build process, it is important to use ac om prehensive Design fo r Reliab ility (DFR) proc ess. The DFR ac tivities need to be

ac c om p lished c onc urrently with c onve ntional de sign ac tivities. If DFR ac tivities

tend to lag conventional design, components will be chosen and purchased

and it will be d iffic ult to m ake c hang es.

One of the key elements of the DFR process is a tool called Reliability Block

Diag ram (RBD) ana lysis. During RBD ana lysis a m odel of the ultima te system

design is c rea ted . Eac h ma jor com pone nt tha t is sub jec t to fa ilure is rep resente d

by a single box and each box contains the factors that represent the statistical

reliab ility of the e leme nt it rep resents. Afte r the mo del is assem bled , it is then

possible to calculate the system reliability using either repeated simulations or

ma nua l c a lc ula tions. In either c ase, the results p rovide an estimate o f the

expec ted reliab ility of the system. If the c a lcula ted o r simulate d reliab ility is less

tha n req uired it is possible to a dd red undanc y to spec ific loc a tions or to increase

the reliab ility by selec ting mo re reliab le c om pone nts. In either ca se, re-running

the calculations or simulations will show if the modifications are adequate or if

ad ditional cha nges need to b e m ad e to ac hieve the target reliab ility.

19


21/23

Another step in the DFR proc ess is to c a lcula te the a va ilab ility of the system. This

c an be d one in two wa ys. First, if RBD softwa re ha s been used to c alculate the

expected reliability, it is typically possible to add information describing the

expected response time to failures and describing the anticipated amount of

planned outage time to account for Predictive Maintenance, Preventive

Ma intenanc e and Outa ge s for Overhaul or Turnaround.

It is a lso p ossible to perform a ma nua l estima te o f the a va ilab ility. This is done b y

identifying the longest cycle between non-repetitive events then identifying the

tota l do wn time d uring tha t co mp lete interval.

As an exam ple, assume tha t a p lant requires a minor outa ge for bo iler inspec tion

ea c h year. The outa ge p eriod for these outa ge s is one w ee k. Also assume that

the p lant requires a tw o w eek outa ge for c ata lyst cha nge every other yea r. In

a lternating yea rs, the b oiler inspec tion c an be do ne during the c ata lyst renewa l.

Finally assume tha t every ten yea rs a six we ek long turnaround is required .

The c umulat ive d ow ntime is as follow s:

Yea r 1 1 week

Yea r 2 2 weeks

Yea r 3 1 week

Yea r 4 2 weeks

Yea r 5 1 week

Yea r 6 2 weeks

Yea r 7 1 week

Yea r 8 2 weeks

Yea r 9 1 week

Yea r 10 6 weeks

Tota l Down Time 19 weeks

Tota l time in cyc le 520 weeks

Planned Ava ilab ility = (520 19) / 520 = 96.53%

20


22/23

For the sake of completeness, lets assume that the plant also experienced one

unplanned failure each year and it required one week to recover from each

unp lanned outa ge . In tha t c ase the Unp lanned reliab ility is as follow s:

Unp lanned Reliab ility = (520 10) / 520 = 98.07%

The c om bined ava ilab ility would b e:

Ava ilab ility = (520 19 10) / 520 = 94.42%

If the capacity demand for the product being provided by the plant is more

than 95% of the design capacity of the plant, the availability would be

inad eq uate . It would b e nec essary to ta ke step s that w ould red uce either the

planned o utage time o r the unplanned outag e time.

Another tool useful in performing the DFR analysis is Reliability Centered

Management (RCM ) ana lysis. RCM ana lysis is useful in identifying a ll thepredictive and preventive maintenance tasks that will be conducted over the

life o f the a sset . It is a lso usefu l in identifying a ll the elem ents tha t w ill be a llow ed

to run to fa ilure so the rea c tive rep air tasks c an be ident ified . Onc e the

c om plete list o f Pred ictive, Preventive a nd Rea c tive repa ir tasks a re ident ified, it

is possib le to desc ribe the step s needed to c om plete the tasks. By pe rforming a

me nta l or physica l wa lk throug h o f a ll tasks it w ill be possib le to dete rmine if the

asset is ma inta inab le. If tasks req uire a n unsure a mo unt of time to c om plete o r

p rod uc e unc ertain results, the ta sks and the asset a re no t ma intainab le.

Op erate There a re tw o p op ula r too ls for imp roving the op erato r interfac e w ith

the eq uipm ent he op erates. One is c a lled Tota l Prod uc tive Ma intenanc e, the

other is Op era tor Driven Reliab ility. In bo th cases, the o b jec tive is to use a dded

structure and discipline in the relationship between the operator and the

equipment to facilitate the objectives of doing no harm to the equipment and

performing som e a c tivities tha t will do som e g oo d.

Inspec t The inspec tion p roc ess used to identify and mo nitor on-going

deterioration can be significantly enhanced using a well structured and

disciplined Failure Mapping process as described in the Path to Failure and

Co rrec tive Ac tion disc ussion earlier. Close trac king of Fa ilure Modes and

identification of associated Failure Mechanisms will identify the ongoing

dete riora tion tha t should be the foc us of inspec tion efforts.

Maintain The o bjec tive o f both p roa c tive a nd rea c tive m aintenanc e ta sks is

to m a inta in and resto re the inherent reliab ility of the a sset . RCM is an exce llent

too l for identifying the ta sks tha t will be c om pleted ove r the lifec yc le o f an a sset.

Once tasks are identif ied and managed by the Computer maintenance

21


23/23

Ma nag em ent System (CMMS), it is important to ensure that all the tasks are

being d one in a m anner tha t restores the Inherent Reliab ility. The fo llow ing a re

examples of situations tha t w ill fail to resto re the Inhe rent Reliab ility:

Fa il to ma intain red undanc y as inc luded in the initial design

Do no t use rep lac em ent parts with the same rob ustness as origina l pa rts

Use inap prop riate proc ed ures

Ignore quality control and quality assurance steps at the completion of

tasks

Take short cuts

Ignore c ritic a l toleranc es, fits and c lea ranc es in assem b ly of eq uipme nt

Overhaul, Turna round or Outa ge Overhauls, Turna round s and Outa gescontain many of the same elements as simple maintenance above so it is useful

to review tha t sec tion. In ad d ition, these ma jor eve nts typica lly are intend ed to

provide reliable operation of an asset for much longer time than typical

ma intena nc e. As a result, it is c ritic a l tha t run-limite rs rec eive spec ial a ttention

and that they be provided with sufficient wear allowance to ensure they will

survive for the entire intended run-leng th.

Modification and Renewal Both Modification and renewal contain many of

the same e lem ents of the initial p rojec t design. If new c hoices a re being ma de

concerning changes to the configuration of the asset or choices of the reliabilityof rep lac ed c om pone nts, RBD will be helpful in making the dec isions.

Two add itiona l too ls tha t w ill assist in ma king sound dec isions during mo d ific a tions or

renew a l ac tivities a re Lifec yc le Costing (LCC ) and Tota l Co st of Ow nership (TCO ). LCC

takes into c onsideration a ll c osts tha t w ill oc c ur over the entire lifec yc le of t he asset.

22

reliability management 2

Documents