white paper on mtbf

FAILURE RATE• The number of failures of an item within thepopulation per unit of operation (time, cycles,

miles, runs, etc.)

ELECTRONIC SYSTEMRELIABILITY - WHY IMPORTANT?• PROBLEMS– Electronic systems involves the utilization of verylarge numbers of components which are very similar.– The designer has little control over their productionand manufacture but must specify catalogue items.– The designer has little control over device reliability.– Control of the production process is a majordeterminant of reliability.– It is difficult to test for electronic component defectsthat do not immediately affect performance.• SOLUTION: Very close attention must be paid toelectronics part reliability. The design must involve a

reliability team.

OUTLINE• DEFINITIONS• CAUSES OF ELECTRONIC COMPONENT FAILURE• PREDICTION METHODS- TEST• Mil- HDBK- 217 PREDICTION METHODS- CALCULATIONS– PARTS STRESS ANALYSIS PREDICTIONS

– PARTS COUNT RELIABILITY METHOD– LIMITATIONS• ADDITIONAL INFORMATION– Other Failure Rate Data Sources

– Arrhenius Model

DEFINITIONS • OPERATING STRESS– The actual stress (or load) applied duringoperation of the part (e. g. voltage forcapacitor, dissipated power for resistors)• RATED STRESS– The manufacturers rating for the part.• STRESS RATIO– Ratio of operating stress to rated stress.• PART GRADES

– Grade 1, 2 etc. designates high qualitystandard parts.– JAN, Industrial and Commercial Gradesdesignations for other parts that can be

used.

BACKGROUND• Reliability engineering and management grewup largely in response to the problems ofelectronic equipment reliability.• Many reliability techniques have been

developed from electronics applications.

CAUSES OF ELECTRONICCOMPONENT FAILURESElectronic Failures =f ( design, mfg. process, quality type,temperature, electrical load, vibration,

chemical, stresses )

OTHER CAUSES OF ELECTRONICCOMPONENT FAILURES (con't)

Electrical Load• Higher that anticipated voltage or current loads cancause arcing, and other damage.Vibration• Shock and vibration can cause fatigue damage to evenproperly made components.Chemical• Contaminants introduced in the manufacturing processmay eventually degrade an IC or other device.• Environmental contaminants (moisture, etc) may

promote chemical attacks on components.

Mil- HDBK- 217 PREDICTION METHODS

PARTS STRESS ANALYSISPREDICTIONS• This method is applicable when most of thedesign is completed and a detailed parts listincluding parts stresses is available.• This model takes into account part quality,use environment, the base failure rate (which

includes electrical and temperature stresses)


PARTS STRESS ANALYSIS (con't)

p = b T A R s c

Q E (Failures/ 10 6 Hour)where:p = parts failure rate (Failures/ 10 6

Hours)b = base failure rate (often with electrical, temp. stress)T = Temperature Factor (dimensionless typical 1 - 150)A = Applications Factor (dimensionless, typical 1- 5)

R = Power Rating Factor (dimensionless, typical 0.5- 1.0)


PARTS STRESS ANALYSIS (con't)

s = Voltage Stress Factor (dimensionless, typical 0.1- 1.0)c = Construction Factor (dimensionless, typical 1 - 5)Q = Quality Factor (dimensionless, typically 0.7 to 8.0)E = Environmental Factor (dimensionless, typical 1 - 450)Each devices uses some or all of these factors. Other

factors are also used.


COMBINING RESULTS• The general procedure for determining boardlevel failure rate is to:• Sum individually calculated failure rates foreach component.

• This summation is then added to a failure ratefor the circuit board (which includes theeffects of soldering parts to it).• Then effects of connecting circuit boardstogether is accounted for by adding in a

failure rate for each connector.


Non- operating Failures• Parts continue to fail even when not in use. Ingeneral electronic parts fail less frequentlywhen not operating because failures arerelated to operating stress. But other

components tend to degrade even when notin use. Example:– Hydraulic parts fail because organic rubberseals out gas and cross link when exposedto heat and ultraviolet light.– Solid rocket engines undergo chemicaldegradation and can develop cracks.

•R s = R operating R non operating


Parts Count Reliability Method• Used early in the design or when detaileddata is not available.• Uses Generic Part Type, a Quality Factor and

Environmental Factor.• information needed:• (1) generic part types (including complexityfor microcircuits) and quantities,• (2) part quality levels, and

• (3) equipment environment.


Parts Count Reliability Method

EQUIP = N i ( g Q

) iEQUIP = Total equipment failure rate (Failures/ 10 6 hrs.)g = Generic failure rate for i th

generic part.

Q = Quality factor for the i th

generic part .

N i = Quantity of the i th generic part .n=Number of different generic part categories inthe equipment.i= 1

i= n


LIMITATIONS• RELIABILITY PREDICTION MUST BE USEDINTELLIGENTLY, WITH DUE CONSIDERATIONSTO ITS LIMITATIONS

• FAILURE RATE MODELS ARE POINTESTIMATED WHICH ARE BASED ONAVAILABLE DATA–THEY ARE VALID FOR THE CONDITIONSUNDER WHICH DATA OBTAINED ANDDEVICES COVERED.

–MODELS ARE INHERENTLY EMPIRICAL

Purpoee - The purpose of thfs MruboOk is to establish and maintain consistent and uniformti.~ for estimating the hhemnt rek&Slity (i.e., the reUabflityof a mature design) of rnilbry @edron&~~~ - systems. It provides a common basfs for ~ predictionsckhg aoquis&bn progmmsfor military ebctrcmc systems and equipment. h atso establishes a common basis for oomparfng andevafuatlng reliability predictions of rdated or competitive destgns. The handbook is intended to be usedas a tool to increase the reliabil”~ of the equ@merxbeing designed.1.2 Appllcatlon - This handtmok oontains two methods of reMWiJity pmdiotbn - “Part StressAnalysis” In Sectfons 5 through 23 amf 7%rts Count- in Appendix IL These methods vary in degree ofinformatbn needed to apply them. lhe Part Stress Anafysii Method recpires a greater amount of detailed

In&mtfon and ts appfkabfe mrfng the later design phase when actual hardware and c&wits are beingdesigned. The Parts Count Method raquires less infonnatbn, generally part quantities, qmtity level, andthe applkatbn environmen& This method Is appfioable cMng the early de- @ase and du~ pmpo@formulation. In general, the Parts Count Metfwd wffl usually result in a more conservative estknate (i.e.,~f*mte)ofsy’stem r@taMtythanthe Parts Stress Method.1.3 Computerfzad Rellablllty PmcffctlOn - Rome Laborato~ - ORACLE is a computer programdeveloped to aid in appfying the part stress analysis procedure of MIL-HDBK-217. Based onenvironmental use chamcteristks, piece part oount, thermal and electrical stresses, subsystem repair ratesand system configuration, the program calculates piece part, assemMy and subassembly failure rates. Italso flags overstressed parts, afbws the user to perform tradeoff analyses and provides system mean-time-to-failure and availability. The ORACLE computer program software (available in both VAX and IBMco~atible PC versbns) is available at replacement tape/disc cost to all DoD organizations, and tocontractors for applbcatbn on spedfk DoD contraots as government furnished property (GFP). Astatement of terms and conditions may be obtained upon written request to: Rome Laborato~/ERSR,

Grtffiss AFB, NY 13441-5700.

What is MTBF?MTBF is an acronym for Mean Time Between Failures. In general, a higher MTBF number indicates a more reliable product. Beyond this simple definition, you’ll find a wide variety of special meanings.

In the military/aerospace industries, MTBF is defined by a specific set of calculations. The formula for system longevity is based on the thermal, electrical and environmental stresses on each component. The engineer evaluates the components and subassemblies in a particular product by these formulas and produces an overall number called calculated MTBF.

Another way to compute MTBF is to evaluate product reliability based on the product’s actual performance in the field. Instead of theoretical calculations of what might occur, field MTBF is a measure of the numbers and types of failures that the products actually experience in real applications.

At Liebert, we track two types of field MTBF statistics: critical bus MTBF and hardware MTBF. In the next few paragraphs, we will explain each of these.

Critical Bus MTBFOur primary focus is on critical bus MTBF. This measures how effectively the UPS, batteries and bypass source can support the customer’s critical load without a failure attributable to the UPS or System Control Cabinet.

Liebert maintains a database with information on every Series 600 UPS ever shipped. We also keep records of all reported failures. Each quarter we evaluate the reliability information and tally up the critical bus outages that were attributable to the UPS or System Control Cabinet.

Some events are excluded from the total. For example, if a UPS experiences an alarm condition and successfully transfers the load to bypass, there is no critical bus outage.

Likewise if utility input power fails and the UPS and batteries support the critical load for the proper number of minutes, the UPS has done its job. If the utility power (or backup Diesel generator) is not available when the UPS has drained the batteries, the UPS -- with ample warning to the operator -- will perform an orderly shutdown. This is not a chargeable critical bus outage since the equipment performed as designed.

Other excluded situations are those caused by site conditions or operator error. For example, one customer wired his facility fire alarm system to trigger the Emergency Power Off circuit on the UPS. Unfortunately, he forgot to disconnect the circuit before performing a routine test of the fire alarm system. This caused a critical bus outage, but did not count against UPS MTBF.

What have we done lately?Each quarter we tally up the cumulative system operating hours and the total number of critical bus outages reported since the introduction of the Series 600 UPS.

As of this writing, we have records of more than 7,000 Series 600 modules in more than 5,500 systems. Cumulative system operating hours exceed 220 million. Since shipments began in 1989, we have records of just 80 critical bus failures. Considering our exposure is approximately 4 million system operating hours per month, this is a remarkably small number of failures.

We compute our field MTBF numbers by dividing system operating hours by “failures plus one.” We do this to be conservative and to be consistent with earlier published documents. Dividing 220 million hours by 81 (80 + 1) gives us a number considerably in excess of 2 million hours. We recognize that some Series 600 sites are not under contract to Liebert Global services and might not be reporting all failures. Therefore we choose not to advertise the exact calculated number. “In excess of one million hours” is sufficient.

Module MTBFThe other way we track reliability is the field MTBF of the UPS modules. For these purposes, we count every type of module or System Control Cabinet failure that causes the module to take itself off-line. As before, we exclude incidents of operator error, site problems or instances of shutdown after successful discharge of batteries.

To compile this number, we have taken various sample periods. For a challenge, one of the periods was chosen to coincide with one of the worst heat waves on record in large portions of the Midwest and Northeast. A difficult test indeed!

During the sample periods, Series 600 UPS modules accumulated approximately 6 million operating hours and 35 hardware failures. Of these, only one caused a critical bus outage. The other 34 events featured the UPS successfully transferring the load to the bypass source. Dividing 6 million hours by 35 gives a module MTBF of approximately 170,000 hours.

MethodologyThe Equations Failure Rate, MTBF, and FITs Description of Methodology The parts count method is a technique for developing an estimate or prediction of the average life, the Mean Time Between Failures (MTBF), of an assembly. It is a prediction process whereby a numerical estimate is made of the ability, with respect to failure, of a design to perform its intended function. Once the failure rate is determined, MTBF is easily calculated as the inverse of the failure rate, as follows:

MTBF = 1 FR1 + FR2 + FR3 + ...........FRn

where FR is the failure rate of each component of the system up to n, all components

The general procedure for determining a board level (or system level) failure rate is to sum individual failure rates for each component. For MIL-HDBK-217, the summation is then added to a failure rate for the circuit board, which includes the affect of solder joints. Component failure rates are provided by MIL-HDBK-217, "Military Handbook, Reliability Prediction of Electronic Equipment", as standard part failure rate models or directly from the manufacturers. The failure rates presented apply to equipment under normal operating conditions, i.e., with power on and performing its intended function in its intended environment. Consideration is given to various environments, component quality, and thermal aspects. The Equations A sample calculation for integrated circuits taken from MIL-HDBK-217 is as follows:

Failure Rate = (C1 * PiT + C2 * PiE) * PiQ * PiLEach factor in this equation is dependent upon a certain part parameter. The end result of this equation is the failure rate of the integrated circuit.

http://www.mtbf.ws/#FITs

http://www.mtbf.ws/#Equations

Failure Rate, MTBF, and FITs For this discussion, we will assume that the resulting failure rate is shown in failures per million hours. This is simply the number of failures that you would expect to have in a million hours of operation of your equipment. Failure rates for many basic devices are well below 1 failure per million hours, so these values may seem insignificant. But if you have hundreds of parts in your design and have a thousand systems operating in the field, you can see that the failure rates will quickly add up. MTBF, or Mean Time Between Failures, is the inverse of the failure rate and is the average time between failures. It is calculated from the failure rate as follows:

MTBF = 1,000,000/Failure RateYou can choose the units in which the failure rate is shown. Another common unit used, besides failures/million hours, is failures per billion hours which is also known as FITs (Failures In Time).

What is MIL-HDBK-217?

MIL-HDBK-217 is a reliability prediction standard originally developed for defense and aerospace related organizations, but later adopted by many commercial and industrial companies. Many times referred to simply as 217, MIL-HDBK-217 includes mathematical reliability models for nearly all types of electrical and electronic components. These reliability models are based on parameters of the components such as number of pins, number of transistors, power dissipation, and environmental factors. Results from MIL-HDBK-217 are provided as both a failure rate and as an MTBF (Mean Time Between Failures) where the MTBF is the mathematical inverse of the failure rate.

white paper on mtbf

Documents