RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center INSIDE T h e J o u r n a l o f t h e 6 Risk Management and Reliability 14 START Sheets 14 New DDR&E Portal 14 RMSQ Headlines 15 New DoD RAM Guide 21 Future Events 22 From the Editor 23 PRISM Column Reliability Analysis Center Second Quarter - 2005 Introduction A widely used measure of product reliability is Mean Time Between Failure (MTBF). Before making a purchase decision for manufacturing equipment or other items, customers frequently require the supplier to provide an MTBF. However, MTBF is not well understood and false assumptions often lead to poor decisions. Improper MTBF calculations used in head-to- head product comparisons can result in sales lost to competitors, higher procurement and mainte- nance costs, and customer dissatisfaction with product experience. This article uses technical discussion, example, and practical application to unveil the mystery behind MTBF. Three of the most commonly used statistical distributions of field failure data (exponential, Weibull, and lognormal) are reviewed. MTBF formulas are presented for each distribution. Monte Carlo simulation is then used to compare and contrast five life data mod- els. The relationship between MTBF and Annualized Failure Rate (AFR) is also discussed and a few of the complexities with performing AFR calculations are reviewed. Readers will be shown common pitfalls associat- ed with reliability metrics and how they can make more informed purchasing decisions that will lead to an improved customer experience. The MTBF Riddle Is it possible for two MTBFs with the same value to tell two completely different reliability sto- ries? To answer this question, an experiment was conducted using Monte Carlo simulation to com- pare and contrast five different sets of life data that have the same MTBF, namely 50,000 hours. In each of these five cases, 100 data points were generated to fit a specified life data distribution using parameters and a seed selected to result in an MTBF of approximately 50,000 hours. Figures 1 through 5 show the resulting probabil- ity density functions (pdfs). Reliability Statistics Fundamentals This article uses three statistical distributions that are important in the field of reliability engineering to model life data, namely the Weibull, exponen- tial and lognormal distributions. We also need to review reliability terminology and the relationship between three fundamental reliability equations. Firstly, the pdf is given by f(t) and represents the relative frequency of failures over time. Secondly, the reliability function is given by R(t) and represents the probability that the product survives until time t. Thirdly, the failure rate function (t) is given by the following equation, and is also referred to as the hazard rate h(t) or instantaneous failure rate. Weibull Distribution The Weibull distribution is highly valued by the reliability engineer because of its flexibility to model many different life data scenarios. For life data that fits a Weibull distribution, the probability density function (pdf) is given by the following equation, where is the scale parame- ter in units of time (also referred to as the char- acteristic life), is the unit-less shape parameter, and is the location parameter in units of time. By: Bill Lycette, Agilent Technologies Practical Considerations in Calculating Reliability of Fielded Products R(t) f(t) (t) = = -t - 1- e -t f(t) RAC Being Replaced by RIAC Under New DoD Contract Effective on the 21st of June, the Defense Information Systems Agency (DISA) has awarded a contract to a team led by Wyle Laboratories, Inc. to operate the Reliability Information Analysis Center (RIAC). The RIAC is the new name for the DoD's chartered center of excellence in the subject areas of reliability, maintainability, quality supportability, and interoperability (RMQSI). The new name was chosen to emphasize that the Center is part of the DoD eleven-member Information Analysis Center (IAC) program administered by the Defense Technical Information Center. The DoD RMQSI IAC will continue to meet government and industry needs under its new name RIAC. RIAC is sponsored by the Office of Secretary of Defense (OSD). Members of the Wyle team competitively selected are Quanterion Solutions Incorporated, the University of Maryland, the Pennsylvania State University Applied Research Laboratory, and the State University of New York Institute of Technology (SUNYIT). The Center will be headquartered at SUNYIT in Utica, NY. A special edition of this Journal will high- light plans for the Center after the July transition from Alion Science and Technology. T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r S e c o n d Q u a r t e r - 2 0 0 5 2 Often represents shipment transit time of the product. The Weibull reliability function is given by: Therefore, the Weibull failure rate function is given by: For any life data that follows a Weibull distribution, the charac- teristic life, , is always the operating time, t, at which 63.2% of the population is expected to fail. A value of less than one indicates a decreasing failure rate and is typical of infant mortality. When is equal to one, the failure rate function reduces to that given by the exponential distribution and the failure rate is constant at 1/. A value of greater than one indicates an increasing failure rate and is typical of wear-out mechanisms. When is equal to two, the pdf becomes the Rayleigh distribution and the failure rate function increases lin- early with time. When the value of is between three and four, the pdf is "normal" in appearance. The MTBF of the Weibull distribution is given by the following equation, where is the gamma function. Since (2) = 1, then the MTBF when is equal to one and is equal to zero is: MTBF = In other words, the only time that the MTBF equals the charac- teristic life is when is equal to one. Exponential Distribution For life data that fits an exponential distribution, the pdf is given by the following equation, where is the failure rate expressed in failures per unit time and is the location parameter in units of time. The exponential reliability function is given by: Therefore, the exponential failure rate function is given by: The exponential distribution is widely used (and often misused) because of its simplicity, and the fact that it has a constant failure rate . The MTBF for the exponential distribution is given by: Note: When is 1.0, the Weibull distribution is equivalent to the exponential distribution, i.e., MTBF = = 1/. It is also the only scenario when the MTBF can be directly calculated using the reciprocal of the failure rate. Lognormal Distribution The lognormal distribution often describes failure processes involving degradation over time, such as corrosion, metal migra- tion, or chemical reaction. The failure rate function is complex and beyond the scope of this article. The times to failure are log- normally distributed if the natural logarithm of the times to fail- ure is normally distributed. The MTBF of the lognormal distribution is given by: For this equation, is the mean of the natural logarithm of the times to failure and is the standard deviation of the natural log- arithm of the times to failure. Further details of these reliability distributions can be found in the literature (References 1 and 2). Instantaneous failure rates, reliability, and MTBFs can be easily calculated using commer- cially available software (Reference 3). The MTBF Riddle Explained Returning to the experiment where Monte Carlo simulation is used to create five unique models of life data, recall that they all have the same MTBF of 50,000 hours. Figure 1 shows the 2-parameter Weibull distribution ( = 0), where is 0.5 and is 25,000. This example illustrates the clas- sic case of infant mortality where the instantaneous failure rate is decreasing. The pdf shows a very high percentage of early fail- ures followed by a steep decline in the number of failures. = -t - e R(t) 1- -t (t) = + + ˇ= 1 1 MTBF ) -(t- e f(t) = ) -(t- e R(t) = = = e e (t) ) -(t- ) -(t- = 1 MTBF ) 2 0.5 ( e MTBF + = T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r S e c o n d Q u a r t e r - 2 0 0 5 3 In Figure 2, the 2-parameter Weibull distribution is shown where is 1.0 and is 50,000. This result is the same that one would get with the single-parameter exponential distribution, namely 1/ =. A value of 1.0 yields a failure rate that is constant over time. It is also the only Weibull scenario where is equal to the MTBF. The pdf in this case starts off at a moderately high level and then the frequency of failures drops off steadily over time. Figure 1. pdf for 2-Parameter Weibull Distribution: = 25,000 and = 0.5 Figure 2. pdf for 2-Parameter Weibull Distribution: = 50,000 and = 1.0 Figure 3 shows the 2-parameter Weibull distribution where is equal to 2.0 and is 56,419. The pdf has a slightly normal appearance and is positively skewed. The frequency of failures starts off low, steadily increases and then gradually tapers off. A 2-parameter Weibull distribution where is 3.0 and is 55,992, is shown in Figure 4. The pdf appears to be normally distributed. In this scenario, a strong wear-out mechanism is at work. The frequency of failures starts out at a very low level, then increases rapidly and subsequently decreases rapidly. Figure 3. pdf for 2-Parameter Weibull Distribution: = 56,419 and = 2.0 Figure 4. pdf for 2-Parameter Weibull Distribution: = 55,992 and = 3.0 Finally, in Figure 5, the lognormal distribution is illustrated where is 10.3 and is 1.0196. The resulting pdf is similar in shape to what is seen in Figures 1 and 2, suggesting modest infant mortality where the failure frequency initially starts out high but then steeply declines. In comparing the five examples, it is clear that MTBF on its own yields very little insight into: 1) the instantaneous failure rates expected over the service life of the product, or 2) the expected survival percentage (reliability function) at any point in time. In fact, without knowledge of how life data is distributed, mistakes in equipment or material procurement decisions are bound to occur. Table 1 shows the expected reliability of the five life data exam- 0 3 .0 E - 5 5 .0 E - 6 1 .0 E - 5 1 .5 E - 5 2 .0 E - 5 2 .5 E - 5 0 1 5 0 0 0 02 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) f( t) 0 3 .0 E - 5 5 .0 E - 6 1 .0 E - 5 1 .5 E - 5 2 .0 E - 5 2 .5 E - 5 0 1 5 0 0 0 02 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) f( t) 0 3 .0 E - 5 5 .0 E - 6 1 .0 E - 5 1 .5 E - 5 2 .0 E - 5 2 .5 E - 5 0 1 5 0 0 0 0 2 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) f( t) 0 3 .0 E -5 5 .0 E -6 1 .0 E -5 1 .5 E -5 2 .0 E -5 2 .5 E -5 0 1 5 0 0 0 0 2 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) f( t) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r S e c o n d Q u a r t e r - 2 0 0 5 4 ples. Suppose a design engineer is developing a system that uses a power supply assembly from two suppliers, both of whom offer an MTBF specification of 50,000 hours. If the required reliabil- ity is 80% at 10,000 hours, then selecting a supplier whose power supply life data behaves as shown in Example 1 ( = 0.5) would yield extremely disappointing results for the customer who purchased the system. Figure 5. pdf for Lognormal Distribution: = 10.3, = 1.0196 On the other hand, suppose the customer-use model of the sys- tem dictates a service life of less than 1,000 hours at which point the system is discarded. Further assume that a reliability of 95% at 1,000 hours is acceptable. Lastly, assume that the power sup- ply assembly from Supplier A has life data distributed as shown in Example 2 ( = 1.0) and costs one-half of an equivalent power supply assembly from Supplier B that has life data distributed as shown in Example 4 ( = 3.0). Clearly, considerable cost sav- ings could be realized by purchasing from Supplier A. Uncertainty in Reliability Metrics Most hardware suppliers cite a single MTBF number, i.e., they provide a point estimate of the MTBF. However, sampling error associated with such a metric can be significant and can lead to costly problems. Understanding the suppliers' confidence bounds on the point estimate can have significant bearing on the buying decision. Figures 6 and 7 illustrate how two identically-distributed sets of life data can have very different confidence bounds. In both cases, a 2-parameter Weibull distribution represents the underlying life data, with equal to 1.0 and equal to 50,000 hours. The only difference between the two examples is the number of failures: 10 failure events are modeled in Figure 6 and 100 failure events in Figure 7. Both examples show the 2-sided confidence bounds. 0 3 .0 E - 5 5 .0 E - 6 1 .0 E - 5 1 .5 E - 5 2 .0 E - 5 2 .5 E - 5 0 1 5 0 0 0 02 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) f( t) Table 1. Reliability of Life Data Distributions that all Have an MTBF of 50,000 Hours Life Data Distribution Reliability at Mission End Time Ex. #1 Ex. #2 Ex. #3 Ex. #4 Ex. #5 Weibull Weibull * Weibull Weibull Lognormal Mission End Time = 25,000 = 0.5 = 50,000 = 1.0 = 56,419 = 2.0 = 55,992 = 3.0 = 10.3 = 1.0196 100 0.936 0.998 1.000 1.000 1.000 200 0.912 0.996 1.000 1.000 1.000 500 0.865 0.990 1.000 1.000 1.000 1,000 0.815 0.981 1.000 1.000 1.000 5,000 0.636 0.906 0.992 0.999 0.924 10,000 0.529 0.820 0.967 0.994 0.777 20,000 0.409 0.671 0.876 0.953 0.536 30,000 0.335 0.548 0.743 0.850 0.382 40,000 0.284 0.447 0.589 0.681 0.282 50,000 0.245 0.365 0.438 0.473 0.214 60,000 0.215 0.300 0.305 0.275 0.166 70,000 0.190 0.243 0.199 0.129 0.132 80,000 0.170 0.198 0.122 0.047 0.106 90,000 0.153 0.161 0.070 0.013 0.087 100,000 0.139 0.131 0.037 0.003 0.072 *The exponential distribution is equivalent to the Weibull distribution when = 1.0. 0 1 .0 0 .2 0 .4 0 .6 0 .8 0 1 5 0 0 0 02 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , (t) R e lia b ilit y , R (t) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r S e c o n d Q u a r t e r - 2 0 0 5 5 Significant uncertainty exists in the product's reliability function when the number of failures in the reliability model is low. Failure to factor in this uncertainty can lead to unexpected, disap- pointing, and costly results experienced by the customer. Figure 6. Weibull Distribution: = 1.0, 10 Failures, 90% CB Figure 7. Weibull Distribution: = 1.0, 100 Failures, 90% CB Using AFR to Calculate MTBF In the previous sections we saw how MTBF is calculated using statistical models of field failure data. Often, field failure data is incomplete or the expertise to create such a failure data model is unavailable. In the absence of such information or methods, the all-too-common (and flawed) practice is to use the familiar annu- alized failure rate (AFR). This method involves taking the recip- rocal of the AFR and multiplying it by the hours per year of oper- ation time T, that is Such a method has a number of problems associated with it. To begin with, there is the inherent assumption that the failure rate is constant over time, i.e., the life data follows an exponential distribution or Weibull distribution when is equal to 1.0. Another difficulty is determining what value of T to use. Depending upon assumed customer use models, such as 24 hours per day, 7 days per week (24x7) or 8x5, the resulting MTBF can vary by as much as a factor of four. The selection of AFR method can also introduce significant vari- ability in MTBF results. Seemingly countless different methods can be used to calculate field failure rates. For instance, Agilent Technologies calculates AFR by taking the number of warranty failures in the reporting month, dividing by the number of units under warranty in that month, and then multiplying by 12 to annualize the result. Jon G. Elerath's paper on the subject does an excellent job in summarizing, comparing and contrasting a num- ber of different techniques (Reference 4). While the reciprocal of AFR may be useful for making a reasonable estimate of MTBF in some cases, the reliability practitioner should at least be aware of the built-in assumptions and potential error that this method intro- duces. Other AFR Considerations In addition to the selection of AFR method, the reliability engineer must pay careful attention to several other variables that influence AFR. For instance, it is critical that failure mode classifications be treated consistently when making head-to-head AFR comparisons. Often, No Trouble Found (NTF) and overstress modes are includ- ed in one model but not another. Another important factor is the selection of an appropriate shipment window. Should it be based on the past one-month, six-month or 12-month shipment history? Or should the lowest AFR achieved over the past 12 months of shipment history be used? Consistency of method, sustainability of reliability, and availability of sufficient life data to assure rea- sonable confidence bounds are important elements to consider. Another important factor in calculating an accurate AFR is the use of complete and accurate life data. It is best to use warranty data for this calculation because it represents the most complete data set typically available. Customers have financial incentives to return warranty failures to the manufacturer for repair. This affords the greatest opportunity for the manufacturer to collect a complete set of failure data from a range of shipment dates. Out- of-warranty failures may be returned to the manufacturer for repair only in one-third or fewer instances, thus making this data set useless for calculating AFR. Any AFR calculated from the data set will yield an erroneous point estimate of MTBF. Conclusions MTBF is often cited by equipment manufacturers as the "go to" reliability metric. However, MTBF on its own provides very lit- tle insight into how the failure rate behaves over time or what the expected reliability will be at any given moment. It is also important to understand the uncertainty associated with an MTBF estimate. In the absence of life data modeling, MTBF is often calculated by taking the reciprocal of AFR and multiplying it by an esti- mated number of hours per year of operation. This method assumes that the product's failure rate is constant over time; however, such an assumption is frequently far from true. Without a solid understanding of a product's life data, substantial errors can occur when calculating MTBF and AFR. Decisions based on flawed methods such as these can result in lost sales to competitors, higher costs to procure equipment and material, and disappointed customers. Acknowledgments The author wishes to thank John Herniman, Scott Voelker, and Greg Larsen for their inspiration and insights that helped make this article possible. 0 1 .0 0 .2 0 .4 0 .6 0 .8 0 1 5 0 0 0 0 2 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 T im e , ( t) R e lia b ilit y , R (t) AFR T MTBF = T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r S e c o n d Q u a r t e r - 2 0 0 5 6 References 1. Applied Reliability, Second Edition, Paul A. Tobias and David C. Trindade, CRC Press, 1995. 2. Practical Reliability Engineering, Fourth Edition, Patrick D.T. O'Connor, John Wiley & Sons, Inc., 2002. 3. Life Data Analysis Reference, ReliaSoft Publishing, ReliaSoft Corporation, Tucson, Arizona, 1997. 4. "AFR: Problems of Definition, Calculation and Measurement in a Commercial Environment", J.G. Elerath, Reliability and Maintainability Symposium Annual Proceedings, January 24-27, 2000, pp. 71-76. About the Author Bill Lycette is a Senior Reliability Engineer with Agilent Technologies. He has 24 years of engineering experience with Hewlett-Packard and Agilent Technologies, including positions By: Ned H. Criscimagna, Alion Science and Technology Risk Management and Reliability Introduction Risk management is one of the critical responsibilities of any manager. The term "risk management" is used by managers and analysts in a number of diverse disciplines. These include the fields of statistics, economics, psychology, social sciences, biol- ogy, engineering, toxicology, systems analysis, operations research, and decision theory. Risk management means something slightly different in each of the disciplines just mentioned. For social analysts, politicians, and academics it is managing technology-generated macro-risks that appear to threaten our existence. To bankers and financial officers, it is usually the application of techniques such as cur- rency hedging and interest rate swaps. To insurance buyers and sellers, it is insurable risks and the reduction of insurance costs. To hospital administrators it may mean "quality assurance." To safety professionals, it is reducing accidents and injuries. For military acquisition managers, it means identifying, prioritizing, and managing the technical, cost, and schedule risks inherent in developing a new weapon system. This article discusses how an effective reliability program can be a valuable part of an overall risk management effort for military system acquisition programs. What is Risk? The American HeritageŽ and Webster dictionaries define the term similarly. These definitions can be summarized as: 1. Possibility of suffering harm or loss: Danger. 2. A factor, course, or element involving uncertain danger: Hazard. 3. The danger or probability of loss to an insurer. 4. The amount that an insurance company stands to lose. 5. One considered with respect to the possibility of loss to an insurer (a good risk, e.g.) A more general definition of risk, perhaps more appropriate for acquisition, is: Risk is the chance that an undesirable event might occur in the future that will result in some negative consequences. This latter definition of risk is often expressed as an equation (Reference 1): Risk Severity = Probability of Occurrence x Potential Negative Impact In the sense of the definition just given, risk is a part of everyday life. We all are faced with uncertainties in our lives, our careers, and our decisions. Since we cannot avoid such uncertainties, we must find ways to deal with them. Similarly, the acquisition manager faces uncertainty concerning the technical challenges in designing a new system, and the cost and schedule estimates. Much effort is expended in trying to assess the technical challenges of a new program, in estimating the costs associated with that program, and in scheduling the pro- gram. In addition to the many constraints placed upon the man- ager, such as budgets, timeframes, and technical state-of-the-art, the uncertainties, or the risks, make the job of managing the pro- gram to a successful conclusion a difficult one. Technical risk affects cost and schedule. As stated in an article in the Journal of Defense Acquisition University (Reference 2): There is no dispute that there is a strong relationship between technical risk and cost and schedule overruns, nor is there any dispute that DoD Project Offices must assess and mitigate technical risk if they are to be successful. However, what must be kept in mind is that technical risk in- and-of-itself does not directly result in cost and schedule overruns. The moderating variable is the manner in which a project's contract is crafted and how deftly the contract is administered, given the nature of a project's technical risk. As an aside, in his 1999 thesis (Reference 3) written for the Naval Postgraduate School, James Ross identified poorly defined requirements as one of the highest risks during pre-solic- itation. Without very clearly defined, justifiable, and realistic requirements, the already difficult task of risk management dur- ing program execution is even more difficult. What is Risk Management? One can compare the job of program management to that of a ship captain directing the safe passage of the vessel through waters