reliability management 2
TRANSCRIPT
-
7/29/2019 Reliability Management 2
1/23
Reliability Management Overview
Course No: B03-004
Credit: 3 PDH
Daniel Daley, P.E., CMRP
Continuing Education and Development, Inc.9 Greyridge Farm CourtStony Point, NY 10980
P: (877) 322-5800F: (877) 322-4774
-
7/29/2019 Reliability Management 2
2/23
Reliability Ma nagement Overview
Introduction
The field o f Reliab ility Ma na gem ent or more d irec tly Reliab ility Eng ineering ha s bee n in
existenc e for a numb er of yea rs but rema ins misunderstoo d by ma ny individua ls in the
industry. You w ill find a num ber of individua ls assigned to p ositions in which they carry
the title of Reliab ility Eng ineer when in fac t the y perform the role o f a Struc tural
Engineer, a Rotating Equipment Engineer, an Electrical Engineer or an Instrumentation
Eng ineer. While the func tions a re no t tota lly exc lusive, engineers in the more
c onventional roles tend to foc us on the functionality or integ rity of a n a sset rathe r
tha n the reliab ility, ava ilab ility or ma intainab ility o f a n a sset .
While a comprehensive treatment of the entire field of reliability would take more time
than illustra ted herein, it is the ob jec tive of this c ourse to p rovide a high level ove rview
of the sub jec t. At the c onc lusion of this c ourse, the stud ent w ill und erstand the
difference between the role of a true Reliability Engineer and engineers in more
c onve ntiona l roles. The student w ill a lso und erstand ma ny of the eleme nts tha t must b e
ma nag ed to ensure tha t ea c h a sset w ill perform in a reliab le m anne r.
What Do You Have a Right to Expec t?
If they realize it or not, when most people purchase a new asset, they have certain
expe c ta tions c onc erning the reliab ility of an asset. Ove r the c ourse o f the life of tha t
asset, those same ind ividua ls have c ontinuing expe c ta tions c onc erning how tha t a ssetshould perform.
In a c om me rc ia l or industrial setting , the same it true. Senior manage rs of c orpo ra tions
have certain expectations concerning the reliability and availability of the assets they
ma nag e. As a result, they have expe c tations c onc erning how muc h prod uct c an be
produc ed and how much inco me the orga nization will prod uce.
As with any other characteristic of a physical asset, an important question one should
ask is, wha t do you have a right to e xpec t? For insta nc e, if you som eho w remo tely
orde red a new c ar but d id not spe cify the co lor and when the new c ar was de livered ,
it was b right yellow ; is there a basis for a c om pla int? If color was not spec ified, onemight a ssume that c olor wa s not important to the b uyer.
While ma ny of the d eta ils lea d ing to reliab ility pe rformanc e a re fa r mo re sub tle than the
color of a new car, the steps leading to assurance that they exist are equally black-
and -white. If the o wne r wa nts an asset to p erform w ith a spec ific level of reliab ility, he
must take the ac tions tha t will p rod uc e tha t pe rformanc e. If he fails to ta ke those step s,
1
-
7/29/2019 Reliability Management 2
3/23
the resulting pe rformanc e is muc h like the c olor of the c ar desc ribed above . It will be
the luck of the draw a nd w ill de pe nd on fac tors the ow ner has c hosen not to ma nag e.
Each and every phase during the life of an asset contains situations that can lead to
either go od reliab ility performa nc e or to po or reliab ility performa nc e. If the ow ner
wa nts to ensure goo d reliab ility p erforma nc e, it is imp ortant tha t he ta kes step s duringthose situa tions in a wa y tha t w ill ensure the leve l of reliab ility performa nc e he hop es to
ac hieve. After providing some of the ba c kground need ed b y the stude nt to
understand deta ils tha t w ill be later desc ribed , this c ourse will provide b rief desc riptions
of the steps that owners should take to ensure the desired level of reliability
performance.
Definitions
There are a num ber of imp ortant definitions assoc iate d w ith the stud y of reliab ility. The
definitions are critical in understanding the subtle elements that determine reliability
performance and the importance of the tools used to ensure adequate performance
for eac h of those e lements.
Let s beg in with the d efinition of the wo rd Reliab ility. In this c ontext, I will use the w ord
Reliab ility (cap ital R) to identify the d efinition of the te rm m ost pe op le are thinking of
when the y say reliab ility. I will use the w ord reliab ility (sma ll r) when refe rring to the
tec hnica l definition of t he spec ific p rop erty of a sset reliab ility.
When m ost p eo p le use the te rm Reliab ility, they a re ac tua lly thinking of a c ha rac teristic
tha t invo lves severa l d istinc t cha rac teristics. When d isc ussing goo d reliab ility or bad
reliability they are actually thinking about the widest spectrum of characteristicsassoc ia ted with the c harac teristic that c an either c ause the m c onfidenc e or co sts.
When using the term Reliability, most people are actually thinking of a characteristic
tha t c onta ins the c harac teristics of reliab ility, ava ilab ility and ma intainab ility. Eac h of
these c ha rac teristic s is d istinct a nd ea c h is the p rod uc t of sep ara te a c tivities. On the
other hand, each of these characteristics has a bearing on the others and therefore
should be d isc ussed tog ethe r.
To define reliab ility , it is a me asure of the likelihoo d tha t a device will avo id fa ilure
over a spec ific interval o f time. As a result, reliab ility is a sta tistica l mea sure tha t results
from unde rstand ing the a c tual number of failures that a n entire p op ulation of a de vic e
c an b e e xpe cted to end ure in a g iven interval.
It is important to understand that the actual reliability of a component is not the same
as the reliab ility of a c om plete asset. The reliab ility of a c om pone nt can vary based on
how it is app lied and how it is used . The reliab ility of a d ev ice is typ ica lly assoc iate d
2
-
7/29/2019 Reliability Management 2
4/23
with one o r mo re spe c ific Failure Mo de s that de termine how and when the d evice will
fail.
Ano the r te rm using the wo rd reliab ility is the term Inherent Reliab ility. The Inherent
Reliab ility of a n a sset , is the likelihood tha t the entire asset will survive w ithout fa ilure over
a spec ific p eriod of t ime . The Inherent Reliab ility of a c om plete a sset is based on thec onfiguration o f the asset a s we ll as the ind ividua l reliab ility of the c om pone nts used to
c onstruc t tha t asset. For instanc e, if the c onfiguration o f an a sset inc ludes red undant
c om ponents in highly critica l loc a tions, it is likely that the Inherent Reliab ility of the asset
will be highe r. Also, if c om ponents with highe r individua l reliab ility a re selec ted o ver less
expensive components with lower reliability, it is likely that the asset will have a higher
Inherent Reliability.
The Inhe ren t Reliab ility of an a sset defines the up per limit or the m aximum reliab ility
performa nc e the asset c an ac hieve. Ac hieving the full Inherent Reliab ility req uires tha t
the a sset b e operate d a nd ma intained as we ll as possible. If the asset is op erate d o rma intained in a sub -op timum m anner, it will not be p ossible to a c hieve the full Inherent
Reliability of the asset.
Another term that most people roll into their intuitive definition of Reliability is the
c ha rac teristic o f ava ilab ility. Ava ilab ility is a me asure o f the p ortion o f time an asset is
ab le to perform its intended func tion. The to ta l ava ilab ility of an a sset is typic a lly
red uce d by two fac tors:
1. Availability is reduced by the amount of time required to recover from
unp lanned fa ilures. This portion of lost ava ilab ility is dep endent on bo th: the
unreliab ility (or freq uenc y of fa ilures) and on the ow ners ab ility to respond to thefailures in a timely and effec tive m anner.
2. Availability is also reduced by the amount of time the asset is shut down to
perform planned p red ic tive or p reventive maintenanc e.
In simp le terms:
Ava ilab ility = Tota l Time (Planned Dow n Time + Unp lanned Dow n Time) / Tota l Time
Where,
Unp lanne d Dow n Time = Sum (Unp lanne d Fa ilures x Time to Respond to ea c h fa ilure)
Typica lly, major assets will req uire som e fo rm o f ma jor ma intenanc e e vent a t a numb er
of points ove r the life of the a sset. These ma jor ma intenanc e events a re c a lled
ove rhauls, turna round s or outa ges. Bec ause the se m a jor ma intenanc e events take so
long and c ost so m uc h mone y, they are a m ajor c onc ern.
3
-
7/29/2019 Reliability Management 2
5/23
Within each major asset, there are one or more components that tend to determine
how frequent ly the outa ges will nee d to take plac e. There are also one o r mo re
c om pone nts tha t dete rmine how long the a sset w ill be shut dow n for rep a irs or rene wa l.
We will call the component that determines the maximum length of time between
outages run-limiters and the components that determine the amount to time the
asset must b e shut dow n dura tion-sette rs .
The run-limiter is typica lly a wea ring c om ponent that be c om es wo rn to the p oint that
it c an no longer perform its intended func tion. By ana lyzing the a sset a nd identifying
the run-limiter it is possible to make that component more robust or capable of
end uring mo re we ar and thus extend the p eriod of time b etwe en outa ge s.
The dura tion-sette r is typic a lly a c om pone nt tha t is either buried dee p in the asset o r
one tha t req uires a time c onsuming renew a l p roc ess tha t enta ils a c ritic a l pa th durat ion
long er than a ny other c om po nent in the asset. Again, by identifying the duration-
setter it is possible to redesign the asset in a manner that reduces the critical pathdura tion and therefore the a mo unt of time the asset is out o f servic e.
The o ther cha rac teristic mentioned as an element tha t ma ny peop le include in their
intuitive definition o f Reliab ility is ma intaina b ility. Ma inta inab ility is a m ea sure of the
ab ility to restore the Inherent Reliab ility in a rata b le period o f time. There a re two
c ha rac te ristics important to m a inta inab ility. The first is the ab ility to resto re the Inhe rent
Reliab ility and the sec ond the a b ility to do so in a ra tab le period of time . A ra tab le
pe riod of time is a known o r rep ea tab le a mount of time .
The c ha rac teristic o f ma intainab ility is ea siest d esc ribed by show ing how one might
perform a ma intainab ility review o f a new a sset.
Eac h com pone nt of a ne w a sset ha s a spec ific reliab ility based on one o r mo re spec ific
Fa ilure Mo des and a n expec ted usab le life. During the usab le life of the a sset, ea c h
c om po nent will req uire bo th proac tive maintenanc e and rea c tive maintenanc e. The
proactive maintenance consists of the predictive and preventive maintenance tasks
needed to minimize unplanned failures by preventing deterioration and to restore the
c om po nent to go od as new c ond itions a t the end of its useful life. The reac tive
maintenance consists of repairs needed to restore asset functionality and Inherent
Reliab ility after an unp lanned failure.
Once all forms of proactive and reactive maintenance that will be needed over the
entire life of an asset, it is possible to review those tasks to see if they are maintainable.
If the tasks include steps that are of an unsure duration or that produce uncertain
results, the task is no t ma intaina b le . An examp le of a ta sk of unsure dura tion is one
tha t req uires the mec hanic or tec hnician to work in an unsa fe or awkwa rd p osition. An
example of a task that will produce uncertain results is one that concludes without a
4
-
7/29/2019 Reliability Management 2
6/23
functionality test o r one tha t c onta ins a step req uiring an atta c hment with an ad hesive
tha t requires spec ial co nd itions to c ure.
Knowing that specific tasks that will be required over the life of an asset are
unma intainab le, this gives the reliab ility enginee r the o pportunity to red esign the asset
to produce a product that is maintainable thus ensuring the desired reliability andavailability.
Returning to the introduction of this short course, the student should think about the
assets w ith which he is familiar. How ma ny of those a ssets ha ve b ee n expo sed to a
c om prehensive ma intainab ility review? How ma ny have b een a nalyzed to identify the
run-limiters and the duration-setters or have been redesigned to increase the
ava ilab ility? Ab sent the ma intaina b ility review, how is it possible to ensure the asset c an
perform a t the desired reliab ility and ava ilab ility over its entire life? Also, look a round a t
your current organization, who is expected to perform the maintainability review
desc ribed above ? While som e o rganizat ions have individua ls with the title Reliab ilityEngineer, few have constructed roles for individuals in those positions that will ensure
reliab ility, ava ilab ility and ma intainab ility performa nc e a t spec ific d esired levels.
Patterns and Relationships
Like so m uc h of eng ineering , the m anage me nt of Reliab ility de pend s on observation of
pa tte rns of events and the rela tionship o f those p a tte rns of events with fa ilures. Unlike
ma ny other eng ineering d isc ip lines, the observations of p a tterns and relations have not
been estab lished and c od ified by individua ls with na me s like Newto n, Planc k, Bernoulli,
Ohm a nd others found in text boo ks. In the business or reliab ility of your assets, you a re
the o ne w ho w ill need to rec ord information and ana lyze it to identify the p atte rns andrela tionships lead ing to the fa ilures of yo ur assets. Even if two of the same assets we re
purcha sed on the sam e d ay b y two different co mp anies, they wo uld be likely to have
d ifferent Reliab ility performa nc e. Tha t is bec ause no two c om panies use the ir assets in
exac tly the same wa y. As a result, the p a tterns of eve nts lea d ing up to a failure a nd
the usab le life and fa ilure modes a re likely to d iffer. For tha t rea son, it is important tha t
the Reliability Management Process and the Reliability Engineers become experts on
the p atte rns and relations spe c ific to the ir own c omp any.
One of the well known ways of describing the relationships between a specific pattern
of e vents and an a ssoc ia ted failure is a d iagram d esc ribing the P-F Interval of a spec ificFa ilure Mod e. In this c onte xt, the term P refe rs to the earliest point tha t the p ote ntial
for failure is know n to e xist. The term F refe rs to the fa ilure eve nt or the point a t which
the component in question has experienced the amount of deterioration needed to
produc e a fa ilure.
5
-
7/29/2019 Reliability Management 2
7/23
The c ha rt b elow is intend ed to desc ribe the P-F interval sta rting w ith the point whe n the
Tota l Base Numb er of a lubric ant has be c om e too low a nd the p oint when a related
failure occurs.
6
-
7/29/2019 Reliability Management 2
8/23
The follow ing a re the d efinitions of the seq uentia l elem ents on the c ha rt:
TBN Normal (Inherent or additive based reserve alkalinity or abili ty to neutralize acidity)
TBN Marginal
TBN Low
Source of acidity introduced
TAN Increases (Acidic concentration of oil)
pH Reaches a point where component deterioration is possible As evidenced by some
other measure
Component Deterioration
Deterioration to the point of added costs to repair or replace components
Deterioration to the point that engine failure is possible
TBN is Tota l Base Number of t he oil sample,
TAN is the Tota l Acid Numb er of the samp le.
When ana lyzing the P-F c ha rt in genera l, it is possible to say the following :
As long as the TBN is norma l, the re is little or no risk of d amage to the system a s a
result of a c id c onta mination.
When a ma rgina l TBN is dete c ted , there is a wa rning tha t som ething is
c onsuming the reserve alkalinity. There is no immed ia te c onc ern over ac id
related damage because with reserve alkalinity the lubricant cannot become
acidic.
When TBN is low , there is an imme d iate c onc ern ove r introd uc tion of a c idity.
The introd uc tion of a source of a c idity is typic a lly not som ething under the d irec t
c ontrol of the operator, so this event c an hap pen a t any time.
Unless the oil sampling process is on a frequent and highly structured routine,finding a samp le with high TAN c an be a hit-or-miss p roc ess.
Onc e the p H of the oil be c om es too low, any lubric a ted surfac e c an be subjec t
to ac id a ttac k.
7
-
7/29/2019 Reliability Management 2
9/23
Depending on how much acid exists in the system and how heavily the asset in
question is currently loaded, deterioration to sensitive components can begin
quite soo n.
At some point the amount of damage to components will add to the repair
c osts.
Finally, depending on the amount of deterioration and the current load on the
asset, a c ata strop hic failure c an oc c ur.
While the pattern described above is generic in nature, the timing and degree of the
elements it co nta ins is spec ific to ea c h owner. For insta nc e, one ma nufac turer of
loc om otive eng ines p rod uc es only four cyc le e ngines and the o ther prod uc es only two
c yc le eng ines. The likelihood of ac id-gas bypassing from e ng ine exhaust ga sses a re
d ifferent fo r the two kind s of eng ines. As a result, the rela tive timing a nd assoc iate d risk
of the P-F Inte rva l for these tw o typ es of eng ines will d iffer. The m anne r in which a
reliability engineer for a company using one type of engine will be different than a
reliab ility eng ineer for a c om pany exclusively using the othe r type of eng ine.
While the generic pattern and relationships for an entire industry may be very similar,
ultimately, the pa tterns and rela tionships a re very muc h comp any spec ific. A
significant portion of Reliability Management and Reliability Engineering is
understand ing the pa tterns and relationships imp ortant to their industry. Even m ore
important is applying that knowledge to the specific patterns of their own company
and plant.
In the example provided above, the rate at which the total base number maydete riora te may be simila r amo ng a va riety of different users. On the o ther hand , the
timing a t which the source o f ac idity is added ma y vary signific antly. By wa y of
exam ple, there are two ma jor manufac turers of freight loc om otives. The eng ines in the
units provided by one manufacturer are two-stroke engines and the other
ma nufa c turer p roduc es four-stroke eng ines. This d ifferenc e will result in some
differences in the rate at which the lubricating oils are exposed to blow-by exhaust
gasses; the mo st c om mo n source o f ac idity in eng ines. As a result, the sha pe o f the P-F
c urve a nd the timing of the rela tionship betw een TBN de terioration and eng ine
dama ge would b e d ifferent in these two c ases. The reliab ility eng ineer working for a
company using primarily two-stroke engines would need to respond to this pattern andthe associated relationship differently than the reliability engineer working for a
c om pany using p rimarily four-stroke eng ines.
The follow ing is another, fa r more c om mo n examp le tha t uses oil as a basis for
understanding patterns and relationships and how they may vary from situation to
situation:
8
-
7/29/2019 Reliability Management 2
10/23
In this c ase, the pa tterns beg in with ea sily c ontrollab le ac tivities including:
Timely oil c ha nges
Time ly o il sampling
Routine checking of oil temperature by sensing the temperature of the bearing
housing
Routine observation of the oil color by viewing the oil bulb on the side of the
bea ring ho using
The pa ttern of e vents lea d ing to ultimate fa ilure c ontinues with the follow ing step s:
Deterioration or thinning of the oil to the point that the film thickness between
the shaft and the be aring is no longer ad eq uate .
Dama ge to the rota ting shaft or the bea ring
Dama ge to the overall pum p where it will no long er perform its intend ed function
Ca tastrop hic failure o f the pump lea ding to a p lant fire.
The va lue o f unde rstand ing the pa ttern of events and the rela tionship be twe en early
signs of deterioration and events leading to failure is that the reliability engineer can
de sc ribe how be st to respo nd to them.
9
-
7/29/2019 Reliability Management 2
11/23
Timely Oil Change How determined?
Timely Oil Sampling How determined?
Check Bearing Housing Temp. Who does it and when?
Check Oil Color in Oiler Who does it and when?
Oil Film Thickness How can this be done?Bearing Damage How Detected?
Shaft Damage How Detected?
Pump Damage How Detected?
Fire - Plant Damage How Detected?
The first four items on the list a re relat ively simp le to mana ge. The first tw o, it is nec essa ry
only to dete rmine the m ost cost effec tive timing to perform the tasks. The sec ond two
items dep end on d ec iding w ho in the o rga nization ha s the b est op po rtunities (ava ilable
time and rea dy ac c ess to ma ke ob servations).
The fifth item , determining the o il film th ickness in a time ly ma nne r when it begins to thin
is impossible. The last four items are a c tivities tha t have to o c c ur a t a time w hen the
reliability eng ineer has no c ontrol over. Onc e be aring d am ag e has be gun, the
following events can happen instantaneously or they can take quite some time to
oc c ur. As a result, the P-F c urve for this pa rticu lar situation te lls us tha t we need to a c t
on the a c tivities over whic h we ha ve c ontrol. We need to c hang e oil on time, sam p le it
on a regular interval, monitor temperature regularly and monitor color on frequent,
reg ular inte rvals.
While the two examples provided above are very specific in nature, the concept ofbeing observant and identifying patterns leading to deterioration and their relationship
with ultima te failures is one tha t c an b e app lied to numerous situations in a n industrial or
c om merc ial setting.
Roles in Reliability
While there are a number of roles in a typical maintenance or reliability organization,
there are several that should be emphasized as to their importance in relation to
performing the day-to-day activities involving processing information and investigating
failures.
Diagnostician In the process of processing the activities that occur as the result of a
system failure, one role tha t is impo rtant to the e ffec tive and efficient use o f information
is the Diagnostician. While this might be a single job in a large o rganiza tion, it might a lso
be a sha red job in a sma ller orga niza tion. The important p oint is tha t the a c tivity of
performing a d iagnosis is one tha t is sep ara te and d istinct.
10
-
7/29/2019 Reliability Management 2
12/23
A d iag nosis is an ac tivity that c an be performed rem otely ba sed solely on the c ontent
of a properly structured Malfunction Report and the information contained in historical
files and c urrent operations da ta . Based on tha t informa tion, a d iagnostician should b e
able to identify the most likely Failure Mode as well as other possible Failure Modes in
ranked o rde r. The d iag nostician provides instruc tions to the t roub leshooter who is next
involved in the c hain of eve nts lea ding to a rep a ir.
Troub leshooter Troub leshoot ing is an inva sive a c tivity tha t e nta ils d isassem bly of the
failed system to identify the failed c om po nent and the co ndition of that co mp onent. A
troubleshooter should be provided with instructions concerning where to start by the
d iagnostician. Sta rting d isassem b ly without know ing the loc a tion of the m ost likely
Failure Mode is problematic, because it leads to wasted time and occasionally,
introd uc tion of new defec ts into the system tha t d id not e xist befo re the troub leshoo ting
began.
The troub leshoo ter identifies the fa iled c om pone nt and p rovide s a d eta iled d esc rip tionof a ll the a ctions neede d to p erform a co mp lete a nd thorough rep air.
Failure Analyst Once the repairs are underway or complete, another step
ac c om plished by an o rga niza tion interested in lea rning and p reventing future failures in
Fa ilure Ana lysis. Fa ilure Ana lysis is the step tha t identifies the Fa ilure Mec ha nism. While
the Fa ilure M od e is the result o f dete riora tion tha t ha s been oc c urring for som e time up
to the failure, the Failure Mechanism is natures tool for forcing the deterioration to
occur.
In mec hanica l comp one nts, there a re fo ur forms of Fa ilure Mec hanisms:
Corrosion
Erosion
Fatigue
Overload
As described above, there mechanisms causing deterioration exist in nature and if the
tec hniques used to p revent them a re no t ma intained, the d eterioration w ill be allow ed
to p roc eed unab ate d until the failure Mo de is present and a failure c an oc c ur.
For instanc e, a p rote c tive c oa ting may prevent uniform c orrosion. Absent, the
presence of the coating, uniform corrosion can occur resulting in metal loss, thinning
and ultimate fa ilure. Another examp le is the rubb er sea l tha t keep s mo isture out of an
elec tric c ab inet. Inside the c ab inet are a variety of dissimila r me ta ls. If the sea l is not
ma intained, wa ter ca n intrude setting up a ba ttery betw een more nob le a nd less nob le
me ta ls. Ultima te ly this c an either result in me ta l loss lead ing to fa ilure or it c an prod uc e
11
-
7/29/2019 Reliability Management 2
13/23
suffic ient c orrosion p rod uc ts tha t w ill find their wa y into the c onta c ts and c ause further
deterioration and failure.
In order to understand the Failure Mechanism that led up to the Failure Mode, it is
important for a Failure Analyst who is familiar with the various forms of Physics of Failure
to investiga te a ll significa nt fa ilures. One point to keep in mind is tha t if som e fo rm o fFa ilure Me c ha nism is ac tively wo rking in one part o f your asset , it is likely working in other
pa rts. Identifying and eliminating tha t Fa ilure M ec hanism w herever it exists c an p revent
a numb er of fa ilures; not just the one tha t was investiga ted .
Cause Analyst While most failures have a variety of causes, there are at least three
levels of c ause:
Physica l Ca use
Huma n Ca use
Systemic Ca use
At the lowest level, there is always one or more Physical Causes that unleash natures
Fa ilure Mec hanism so dete riora tion ca n begin. In the ca ses desc ribed above , the
absence of the protective coating or the absence of the rubber seal is the Physical
Ca use that a llow ed the direc t co ntac t of wa ter with the meta l surfac es. In eac h ca se,
tha t resulted in the Fa ilure Mec ha nism o f Corrosion sta rting the p roc ess of d eteriora tion.
At a leve l above the Physica l Ca use is the Huma n Ca use. The Huma n Ca use is a
specific person who either acted or failed to act in a manner that resulted in the
Physica l Ca use. In the c ase o f the rubber sea l on the elec tric a l enc losure, seve ra lindividua ls ma y be the p hysic a l c ause:
If the maintenance work order called for maintaining the rubber seal at some
point and the work was not done, the crafts person who was assigned to
perform the wo rk orde r ma y be the Huma n Ca use.
If the e lec tric al eng ineer assigned to c rea te the ma intenanc e wo rk order failed
to identify the need to regularly inspe c t a nd ma intain the rubb er sea l, he might
be the Huma n Cause.
If the ma nag er over the a rea where the elec tric al enc losure is loc ate d de c ided
to save some money by removing the seal maintenance task from the work
order, he w ould be the Huma n Ca use.
In any case, it is imperative that the specific person who is the Human Cause be
ident ified . Spea king to tha t person is the only wa y the System ic c ause w ill be identified .
12
-
7/29/2019 Reliability Management 2
14/23
The Systemic Ca use is best d esc ribed as a trap tha t exists in the o rganiza tion, the
procedures, the accepted practices or the behaviors of the overall organization that
a llow s the Human Ca use to ac t in a m anner that p rod uc es the Physic a l Ca use.
In the e xam ples de sc ribe d ab ove:
If the crafts person can pick and choose what portions of the work order he
wa nts to c om p lete , the c ulture has c rea ted a trap that has led to this failure.
If the electrical engineer does not have the time or has not been provided the
guidance to get out in the field and identify the need for replacing the seal,
the re is anothe r systemic wea kness.
If cost cutting and budgetary restraints have caused the manager to remove
nee ded wo rk from the w ork orde r, there is a d ifferent system ic c ause.
In any ca se, the System ic Ca use tha t is identified c an affec t a m uc h broa der array ofissues tha n just the one b eing investiga ted . The p erson p erforming C ause Ana lysis c an
either be the same person o r c an be a d ifferent pe rson tha n the Fa ilure Ana lyst. The
two roles typically require different skill sets and are accomplished in different settings,
so a single ind ividua l assigned to b oth roles typic a lly doe s one bette r than the other.
Governing Patterns in the Life of a Sing le Failure
The desc riptions provided ab ove tend ed to foc us on spec ific deta ils c onc erning either
the pa tte rns and rela tions lead ing to fa ilures or the roles and skills involved in a reliab ility
organization.
This sec tion w ill provide the first o f two gene ra l pa tterns tha t a re impo rtant to
understand when creating systems that will properly address failures, gather information
on fa ilures and take the step s needed to e nsure a long and reliab le life fo r an asset.
The first o f the ge neral p a tterns to be desc ribed is one tha t ide ntifies the step s tha t
oc c ur in either an o vert or pa ssive ma nner during the life of a single failure event. If
hand led in the p rop er manne r, these step s c an lea d to e ffec tive a nd effic ient ha ndling
of the single event . If hand led we ll, these step s c an also lead to a permanent solution
to the p rob lem s that c aused the failure to oc c ur. Handled p oo rly, the defec t ca using
the imm ed iate e vent ma y not be rem oved . This c an lead to a repe at o f the failure and
a ge neral dete riorat ion of the inherent reliab ility of the system .
The t itle used herein for this pa tte rn is the Path to Fa ilure and Co rrec tive Ac tion b ec ause
it iden tified a ll the d isc rete step s lead ing to a failure a nd resolving it. While m any
organizations may not openly acknowledge the presence of each step in the process
as it is described, the steps exist and are being handled in a passive manner without
spec ific foc us or intentiona l mana ge ment.
13
-
7/29/2019 Reliability Management 2
15/23
As desc ribed above , any failure event typ ic a lly beg ins with a System ic Ca use. The
System ic Ca use c rea tes a trap into which an ind ividua l c an fall. The ne xt step is the
Human Cause in which a specific individual either does something or fails to do
something tha t produc es a Physica l Ca use. The third step is the Physica l Ca use. This is
the step in which the form of prevention used to restrain natures Failure Mechanisms are
ignored.
The next step is the Fa ilure Me c ha nism. The Fa ilure Me c ha nism (like c orrosion, erosion,
fat igue o r ove rload ) is a na tural p roc ess tha t causes on-going d ete riora tion. At som e
point in time, the amount of deterioration has reached the point that a defect is
p resent. The d efec t is not the same as the fa ilure. In the c ase o f co rrosion, the d efec t
might be the point at which deterioration has removed enough metal that the
c om po nent is no longer ab le to handle the loa d a t its ma ximum c ond ition. Onc e the
defec t is p resent, the fa ilure c loc k is ticking and the failure is wa iting only on a situation
when the loading is suffic ient to c ause a break at the po int of m aximum de terioration.
In the c ase o f a rusted p ipe , the leak may not oc c ur until the system p ressure is raised to
highe r tha n norma l p ressure.
14
-
7/29/2019 Reliability Management 2
16/23
Onc e the fa ilure oc c urs, the ne xt step is the Ma lfunc tion Rep ort. In som e c ases, the
structure used for issuing a Malfunction Report ensures that clear and accurate
informa tion is p rov ided . In other ca ses, the informa tion p rov ided is useless and p rovides
little he lp in driving to wa rd a n ac c ura te d iagnosis and an ultimate solution.
The next three step s tend to w ork tog ethe r in the effo rt to iden tify the Fa ilure Mod e a ndthe ultima te resolution. The Diag nosis is the hand s-off, remo te a c tivity of using a va ilab le
informa tion to ide ntify likely Fa ilure Mod es. This informa tion c an be used to perform
triage or prioritizing the hand ling of c urrent issues. It c an a lso b e used for ad vanc ed
prepara tion o f too ls and ma teria ls. The next step is funne ling or ana lysis of a ll possible
Fa ilure Mo de s to d etermine w hic h should be a pp roa c hed and in wha t order. The m ost
likely Fa ilure Mo de is typic a lly first to b e a pproa c hed by the troub leshoo ter. On the
othe r hand , there m ight b e a relative low likelihoo d Fa ilure M od e tha t requires little time
or effort to reset . That item might be a ttemp ted first on the o ff-chanc e it prod uce s
immediate resolution. Rec yc ling a co mp uter is an exam ple of suc h a fix.
Troub leshoot ing is the third of these a c tivities. Since it is ha nd s-on a nd invasive, it ta kesthe m ost time a nd it ca n also introd uc e new defec ts into the system . Highly direc ted
troub leshoo ting typic a lly lea ds to the mo st effec tive and efficient rep a irs.
The next tw o item s a lso fit to gethe r in te rms of identifying the Fa ilure Mode tha t c aused
the failure. The troub leshoo ter should identify the defec tive c om pone nt that caused
the failure. He should also d esc ribe the c ond ition of tha t co mp onent that lea d to the
defec t. Freq uently rep lac ed c om po nents that restore functionality a re not d efec tive.
Simp ly get ting inside and shaking things up restores func tiona lity by resto ring p oo r
c ontac ts. In this c ase, the rep lac ed c om pone nt is not defec tive. In this c ase, the true
defect has not been removed and the inherent reliability of the system has not beenrestored . Thus it is c ritica l tha t the troub leshoo ter produc e a truly defec tive co mp one nt.
If is also important that the troubleshooter describe the condition of the defective
c om ponent. Witho ut suc h a desc ription, it is impossible to identify the Fa ilure
Mec ha nism. Without know ing the Fa ilure Mec ha nism, it is impossible to e limina te the
na tura l process forcing d ete rioration.
The next two steps in the p roc ess a re Fa ilure Ana lysis and Ca use Ana lysis. As desc ribed
ea rlier, Fa ilure Ana lysis is an a na lysis of the physics of fa ilure to determine the source of
the d ete riora tion lea d ing to fa ilure. Ca use Ana lysis is the hum anistic a nd orga niza tiona l
pa rt of the Co rrec tive Action p roc ess.
The c ha rt p rovide d a bove includes two othe r elements. The first is ident ific a tion of
Fa ilure Mec hanisms tha t a re in the p roc ess of p rod uc ing dete riora tion befo re fa ilure. It
is imp ortant to no te tha t the Fa ilure Mec hanism is ha rd a t wo rk for a long period befo re
the c om po nent is dete riora ted to the po int tha t failure is possible. If the Fa ilure
15
-
7/29/2019 Reliability Management 2
17/23
Mechanism is identified and remedied before the failure, it is possible to prevent the
failure from oc c urring.
The sec ond eleme nt highligh ts the o pportunity to ident ify the d efec t a fter it has formed
but b efore the failure ha s oc c urred . As mentioned ea rlier, the formation of the d efec t
and the failure a re no t alwa ys simultaneo us. If an a lert person c an find a defec t tha texists befo re a fa ilure, it is a lso p ossible to p revent the fa ilure from oc c urring.
When ea c h of the se step s a re rec og nized and hand led c orrec tly, it is possible to:
Hand le ea c h individua l inc ide nt in an e ffective and effic ient ma nner
Gather the information needed to handle future incidents in an effective
manner
Ga ther informa tion need ed to p reve nt this incide nt in the future
Gather information needed to prevent similar incidents caused by the same
Fa ilure Me c ha nism or Fa ilure Mo de
Investigation of the typical patterns of response to individual failures and how those
patterns ultimately result in either permanent solutions or reduced inherent reliability will
be useful in identifying the actions that must be taken to improve reliability
performance.
Governing Patterns in the Overall Life of a n Asset
The ove ra ll lifec yc le of a n a sset c an b e b roken into a series of p roc esses tha t ultima tely
determine the reliab ility, ava ilab ility and longev ity of the a sset . Those p roc esses includ e
the following:
Design a nd build The inherent reliab ility of an a sset is based on the
configuration of the asset as well as the individual reliability of the components
tha t were selec ted . Ma ny c onve ntiona l design processes address only the
func tiona lity and the struc tural rob ustness of a n asset . While nec essa ry, these
forms of analysis do little to ensure the reliability, availability or maintainability of
an asset . Also, they do not de termine the usab le life of the asset . In order to
address those characteristics, it is necessary to perform Design for Reliability
c onc urrently with co nventiona l design ac tivities.
Operate Assuming that an asset has been designed in a manner that the
Inherent Reliability is capable of providing the performance desired by the
ow ner, tha t is only the beg inning . In order to p rovide the required performa nc e,
the asset must be operated and maintained in a manner that harvests all the
16
-
7/29/2019 Reliability Management 2
18/23
Inherent Reliab ility. There a re tw o things tha t an op erato r of a n asset c an do to
support goo d reliab ility.
First, the operator can operate the asset in a manner that he will Do No Harm.
If the operator understands the Failure Modes and the Failure Mechanisms that
ma y dete riorate the e quipm ent he o pe rate s and he und erstand s the c hoices hecan make that will either prevent or cause deterioration, he can choose the
path tha t will red uc e the am ount o f harm that results from po or ope ration.
Sec ond , the ope rato r c an Do Som e Good as part of his op erating p rac tic es.
In ea rlier disc ussions of P-F interva ls, the o pportunity for someo ne to regularly
ob serve o il tem pe rature and oil c olor we re m entioned. Op erators are typica lly
the resource with the greatest and most frequent opportunity to make these
ob servations. If the op erato r takes ac tion whe n the bearing housing of
equipment items are too hot or when the oil is just beginning to discolor, it is
possib le for the op erator to Do Som e G ood .
While the examples provided are fairly simple, there are myriad examples for
operators to modify the way they interface with equipment to positively affect
the equipment reliability so deterioration is avoided and the full Inherent
Reliab ility is ac hieved.
Inspec t Throug hout the life o f an asset , there a re situat ions in which individua ls
with spec ialized expe rtise p erform inspec tions of the eq uipment. Freq uently the
individuals are trained to recognize corrosion, vibration, electrical system
deterioration and a variety of other forms of specialized patterns that provide
c lues pertaining to incipient o r on-going d ete rioration. These individua ls a remade more effective if they are made aware of Failure Modes and Failure
Mec hanisms known to be p resent in the ac tua l system s being inspec ted . Rather
than looking for everything, they can focus their attentions on problems that
have hap pened in the pa st and a re likely to oc c ur aga in in the future.
Maintain One of the on-going routines during the lifecycle of an asset is the
need to perform ma intenanc e. There a re two forms of ma intena nc e: proa c tive
and reac tive. Proa c tive Ma intenanc e c an e ither be Pred ic tive or Preventive.
Predictive Maintenance is typically non-invasive and uses special tools and
techniques to identify signs of incipient failure or on-going deterioration.Activities performed as part of an inspection is typically intended to be
p red ictive in na ture. Preve ntive Ma intenanc e is anothe r form of Proa c tive
Ma intenanc e. Preve ntive Ma intena nc e is tangib le ac tivities performe d to
excha nge o r renew a c omp onent ba sed on knowledg e that t ime ha s co me for
the rep lac eme nt to oc cur.
17
-
7/29/2019 Reliability Management 2
19/23
Reactive maintenance is the form of maintenance that is accomplished in
response to a failure. The Path to Fa ilure a nd C orrec tive Ac tion de sc ribed above
provides a comprehensive description of the steps included in reactive
maintenance.
In either case (proactive or reactive maintenance) the objective is not to simplyresto re or ensure the func tiona lity of an a sset . The ob jec tive is to m ainta in or
resto re the inherent reliab ility of the asset .
Ove rhaul, Turnaround or Outa ge On som e reg ula r but no n-routine b asis ma jor
assets need to be maintained using a major effort that is called an overhaul,
outa ge o r turna round dep end ing on the industry. These events a re highly c ostly
and they have the single greatest impact on the availability of the asset
involved . Most system s have e ither one o r a sma ll numb er of comp one nts tha t
reach the end of their useful life and, as a result, set the timing when the
ove rhaul, outa ge or turnaround must oc c ur. For simplic ity we c an ca ll these run-limiters . There is a lso one o r a sma ll num ber of comp onents tha t req uire the
longest critical path of activities during the overhaul, outage or turnaround.
Aga in for simplic ity, we c an c a ll these item s the dura tion-sette rs .
Apart from reducing the number of failures and thus increasing the amount of
unplanned outa ge time , the best w ay to imp rove the a vailab ility of a n a sset is to:
1. Ma ke the run-limiters mo re rob ust so the a sset c an run long er be twe en
outages
2. Make the duration-setters more maintainable so the duration ofoutages is shorter
Modific a tion Ma jor assets a re freq uently mo d ified over the ir lifec yc le. They
c an b e m od ified to a lter their pe rforma nc e o r to increa se their c ap ac ity. It is not
uncommon for modifications to be accomplished without adequate attention
pa id to the Design fo r Reliab ility. In those c ases, the resulting reliab ility of the
mo d ified asset is less tha n the asset b efo re m od ific a tion.
Renewal As with modification, great many assets go through a renewal
p roc ess to b rea th new life into ag ed assets. As with mo d ific a tions, it is not
unco mm on for inad eq uate atte ntion be ing pa id to Design for Reliab ility during
the renewa l p roc ess. In this c ase, the a sset is onc e a ga in p rep a red for a long
but unreliab le life.
An important point to keep in mind is that it is critical to remain vigilant over the entire
lifec ycle o f an a sset to ensure g oo d reliab ility. You c annot b e vigilant 90% of the time
18
-
7/29/2019 Reliability Management 2
20/23
then drop your gua rd . The dam ag e done d uring eve n a short pe riod of inattention c an
result in a loss of c ritic a l cha rac teristics tha t it will be d ifficult o r impossible to rep lac e.
The imp ortanc e in und erstand ing the pa ttern o f events and proc esses that oc c ur over
the entire lifecycle of an asset comes in being able to determine if reliability is being
properly assessed a t those times. For insta nc e, there a re organiza tions tha t havereliability analysis well integrated with the processes where the engineering discipline is
involved . Since initial design, mod ifica tions and rene wa l typ ic a lly involve enginee rs in
the activity; those three activities would be the ones most likely to benefit from that
involvem ent. In som e orga niza tions, the op posite c an also be true. Ac tivities hand led
by the p lant resources (ope ra tion, ma intenanc e a nd inspe c tion) ma y be nefit from the
involvement of reliability engineers in those activities while project managers who are
focused solely on cost and schedule may refuse to address additional requirements
that a dd co st or take m ore time.
For an asset to be truly reliable, it is important that elements and activities leading togoo d reliab ility are a ddressed on a ll occ asions.
Too ls for Eac h Phase in the Life o f an Asset
Each point in the life of the asset has activities that can help improve the reliability,
ava ilab ility and ma inta inab ility of an asset. On the other hand if these cha rac teristics
are ignored at any of those points, the characteristics needed to produce reliable
performa nc e c an be lost. The follow ing de sc ribes the to ols typic a lly assoc iated with
ea c h p hase in the lifec ycle of a n a sset:
Design and build During the design and build process, it is important to use ac om prehensive Design fo r Reliab ility (DFR) proc ess. The DFR ac tivities need to be
ac c om p lished c onc urrently with c onve ntional de sign ac tivities. If DFR ac tivities
tend to lag conventional design, components will be chosen and purchased
and it will be d iffic ult to m ake c hang es.
One of the key elements of the DFR process is a tool called Reliability Block
Diag ram (RBD) ana lysis. During RBD ana lysis a m odel of the ultima te system
design is c rea ted . Eac h ma jor com pone nt tha t is sub jec t to fa ilure is rep resente d
by a single box and each box contains the factors that represent the statistical
reliab ility of the e leme nt it rep resents. Afte r the mo del is assem bled , it is then
possible to calculate the system reliability using either repeated simulations or
ma nua l c a lc ula tions. In either c ase, the results p rovide an estimate o f the
expec ted reliab ility of the system. If the c a lcula ted o r simulate d reliab ility is less
tha n req uired it is possible to a dd red undanc y to spec ific loc a tions or to increase
the reliab ility by selec ting mo re reliab le c om pone nts. In either ca se, re-running
the calculations or simulations will show if the modifications are adequate or if
ad ditional cha nges need to b e m ad e to ac hieve the target reliab ility.
19
-
7/29/2019 Reliability Management 2
21/23
Another step in the DFR proc ess is to c a lcula te the a va ilab ility of the system. This
c an be d one in two wa ys. First, if RBD softwa re ha s been used to c alculate the
expected reliability, it is typically possible to add information describing the
expected response time to failures and describing the anticipated amount of
planned outage time to account for Predictive Maintenance, Preventive
Ma intenanc e and Outa ge s for Overhaul or Turnaround.
It is a lso p ossible to perform a ma nua l estima te o f the a va ilab ility. This is done b y
identifying the longest cycle between non-repetitive events then identifying the
tota l do wn time d uring tha t co mp lete interval.
As an exam ple, assume tha t a p lant requires a minor outa ge for bo iler inspec tion
ea c h year. The outa ge p eriod for these outa ge s is one w ee k. Also assume that
the p lant requires a tw o w eek outa ge for c ata lyst cha nge every other yea r. In
a lternating yea rs, the b oiler inspec tion c an be do ne during the c ata lyst renewa l.
Finally assume tha t every ten yea rs a six we ek long turnaround is required .
The c umulat ive d ow ntime is as follow s:
Yea r 1 1 week
Yea r 2 2 weeks
Yea r 3 1 week
Yea r 4 2 weeks
Yea r 5 1 week
Yea r 6 2 weeks
Yea r 7 1 week
Yea r 8 2 weeks
Yea r 9 1 week
Yea r 10 6 weeks
Tota l Down Time 19 weeks
Tota l time in cyc le 520 weeks
Planned Ava ilab ility = (520 19) / 520 = 96.53%
20
-
7/29/2019 Reliability Management 2
22/23
For the sake of completeness, lets assume that the plant also experienced one
unplanned failure each year and it required one week to recover from each
unp lanned outa ge . In tha t c ase the Unp lanned reliab ility is as follow s:
Unp lanned Reliab ility = (520 10) / 520 = 98.07%
The c om bined ava ilab ility would b e:
Ava ilab ility = (520 19 10) / 520 = 94.42%
If the capacity demand for the product being provided by the plant is more
than 95% of the design capacity of the plant, the availability would be
inad eq uate . It would b e nec essary to ta ke step s that w ould red uce either the
planned o utage time o r the unplanned outag e time.
Another tool useful in performing the DFR analysis is Reliability Centered
Management (RCM ) ana lysis. RCM ana lysis is useful in identifying a ll thepredictive and preventive maintenance tasks that will be conducted over the
life o f the a sset . It is a lso usefu l in identifying a ll the elem ents tha t w ill be a llow ed
to run to fa ilure so the rea c tive rep air tasks c an be ident ified . Onc e the
c om plete list o f Pred ictive, Preventive a nd Rea c tive repa ir tasks a re ident ified, it
is possib le to desc ribe the step s needed to c om plete the tasks. By pe rforming a
me nta l or physica l wa lk throug h o f a ll tasks it w ill be possib le to dete rmine if the
asset is ma inta inab le. If tasks req uire a n unsure a mo unt of time to c om plete o r
p rod uc e unc ertain results, the ta sks and the asset a re no t ma intainab le.
Op erate There a re tw o p op ula r too ls for imp roving the op erato r interfac e w ith
the eq uipm ent he op erates. One is c a lled Tota l Prod uc tive Ma intenanc e, the
other is Op era tor Driven Reliab ility. In bo th cases, the o b jec tive is to use a dded
structure and discipline in the relationship between the operator and the
equipment to facilitate the objectives of doing no harm to the equipment and
performing som e a c tivities tha t will do som e g oo d.
Inspec t The inspec tion p roc ess used to identify and mo nitor on-going
deterioration can be significantly enhanced using a well structured and
disciplined Failure Mapping process as described in the Path to Failure and
Co rrec tive Ac tion disc ussion earlier. Close trac king of Fa ilure Modes and
identification of associated Failure Mechanisms will identify the ongoing
dete riora tion tha t should be the foc us of inspec tion efforts.
Maintain The o bjec tive o f both p roa c tive a nd rea c tive m aintenanc e ta sks is
to m a inta in and resto re the inherent reliab ility of the a sset . RCM is an exce llent
too l for identifying the ta sks tha t will be c om pleted ove r the lifec yc le o f an a sset.
Once tasks are identif ied and managed by the Computer maintenance
21
-
7/29/2019 Reliability Management 2
23/23
Ma nag em ent System (CMMS), it is important to ensure that all the tasks are
being d one in a m anner tha t restores the Inherent Reliab ility. The fo llow ing a re
examples of situations tha t w ill fail to resto re the Inhe rent Reliab ility:
Fa il to ma intain red undanc y as inc luded in the initial design
Do no t use rep lac em ent parts with the same rob ustness as origina l pa rts
Use inap prop riate proc ed ures
Ignore quality control and quality assurance steps at the completion of
tasks
Take short cuts
Ignore c ritic a l toleranc es, fits and c lea ranc es in assem b ly of eq uipme nt
Overhaul, Turna round or Outa ge Overhauls, Turna round s and Outa gescontain many of the same elements as simple maintenance above so it is useful
to review tha t sec tion. In ad d ition, these ma jor eve nts typica lly are intend ed to
provide reliable operation of an asset for much longer time than typical
ma intena nc e. As a result, it is c ritic a l tha t run-limite rs rec eive spec ial a ttention
and that they be provided with sufficient wear allowance to ensure they will
survive for the entire intended run-leng th.
Modification and Renewal Both Modification and renewal contain many of
the same e lem ents of the initial p rojec t design. If new c hoices a re being ma de
concerning changes to the configuration of the asset or choices of the reliabilityof rep lac ed c om pone nts, RBD will be helpful in making the dec isions.
Two add itiona l too ls tha t w ill assist in ma king sound dec isions during mo d ific a tions or
renew a l ac tivities a re Lifec yc le Costing (LCC ) and Tota l Co st of Ow nership (TCO ). LCC
takes into c onsideration a ll c osts tha t w ill oc c ur over the entire lifec yc le of t he asset.
22