RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center INSIDE T h e J o u r n a l o f t h e 6 Keys to Reliability Relevance 10 Mechanical Stress/ Strength Interference Theory 13 Applying RCM Analysis to EA-6B Corrosion Failure Modes 19 PRISM Column 21 Future Events 22 From the Editor 22 RMSQ Headlines Reliability Analysis Center Fourth Quarter - 2003 Introduction What can reliability people and biomedical statis- ticians learn from each other? This article begins with an example, and then compares the state of the art (SOTA) in both fields and indicates crossovers ­ what each group can learn from the other. The article ends with a suggestion for both groups on the use of available data. Let's compare methods for dealing with factors that affect system reliability. For example, is the central processor or parallel redundant chambers more important in semiconductor capital equip- ment? Answer this question by determining derivatives of system reliability with respect to part factors, because the larger the derivative, the more important the factor, assuming equal costs per unit change of factors. Reliability Method. Reliability people often assume a constant failure rate. If true, the system reliability, the probability that the central proces- sor (CP) and at least one chamber (CH) will sur- vive to age t, is R(t) = exp[t/MTBFCP]*(1(1exp[t/MTBFCH])3) Figure 1 graphs the derivatives of system MTBF (the integral of R(t) for MTBFCP = 100 hours and MTBFCH = 200 hours). The derivative of system reliability with respect to central processor MTBF is larger. That means that the central processor MTBF has more effect on system reli- ability than chamber MTBF. Figure 2 shows the system MTBF for various part MTBFs. An increase in chamber MTBF of 100 hours increases system MTBF by 40, but the same increase in central processor MTBF increases sys- tem MTBF from 50 to 100 hours, depending on chamber MTBF. Figure 1. Derivatives of System Reliability with Respect to Part MTBFs. The upper curve is the derivative with respect to central processor MTBF Figure 2. System MTBF as a Function of Part MTBFs Survival Analysis Method. In contrast to the reli- ability assumption of constant failure rate, bio- medical statisticians often assume the age-specif- ic failure rate function is a "relative risk" or "pro- portional hazards" [Cox] model: (t;Z) = o(t)*exp(Z*) In this equation, is a vector of regression coeffi- cients, o(t) is the "base" failure rate when Z is zero, and Z is a vector of "concomitant" variables representing factors besides age t that can account for variation in (t;Z) from o(t). Those factors are called concomitant because they accompany sub- jects with factors equal to Z. The proportional haz- By: Larry George, Problem Solving Tools Biomedical Survival Analysis vs. Reliability: Comparison, Crossover, and Advances 35 Years Of Leadership in R&M Celebrating 35 Years of Excellence in R&M 200 400 600 800 1000 MTBF 0.0005 0.001 0.0015 0.002 0.0025 0.003 Derivative 160 200 MTBFCP 100 120 140 160 180 200 MTBFCH 80 100 120 140 MTBF 100 120 140 180 (315) 337-0900 General Information (888) RAC-USER General Information (315) 337-9932 Facsimile (315) 337-9933 Technical Inquiries rac@alionscience.com via e-mail http://rac.alionscience.com Visit RAC on the Web T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 3 2 ards model means that the failure (a.k.a. hazard) rate function of a subject is proportional to the base failure rate function. The exp(Z*) term is the risk relative to the base failure rate, o(t). Figure 3 graphs the derivatives of system reliability with respect to the relative risks, exp(ZCP*CP) = 100 and exp(ZCH*CH) = 200, using the relative risk model. The lower curve is the deriv- ative with respect to central processor relative risk; its greater magnitude indicates that central processor factors have more effect than chamber factors. Figure 3. Derivatives of System Reliability With Respect to Part Relative Risks. The lower curve is the derivative with respect to central processor relative risk Review. This example came from a recent presentation in which the author, a reliability statistician, described randomizing part MTBFs and then using DoE and response surface analysis to answer the importance question. Field failure rates are seldom constant, because data is claimed to be unavailable or expensive and because of variations in process, customer, and environmen- tal factors, so people randomize MTBF. Randomizing MTBF is a weak alternative to getting data and doing statistical analysis. The only reason for randomizing dis- tribution parameters is to represent sample uncertainty. Probability distributions themselves represent randomness, so it is unnecessary to randomize their parameters. The biomedical relative risk model answers the importance ques- tion without assumptions about the failure rate, and it incorpo- rates concomitant factors. It has passed the test of time; it pre- dominates in biomedical survival analysis, even though it was introduced in 1972 [Cox]. The following two sections on the state of the art describe the objectives, data, subjects, profession, support, standardization, publications, software, and statistics used in biomedical survival analysis and reliability. Biomedical Survival Analysis SOTA Objectives. Analysis of ages at failures, usually lightly censored or truncated, to estimate the survivor function (a.k.a. reliability function); to do hypothesis tests, usually to compare treatment and control effects; to make forecasts; and to evaluate the effects of concomitant variables using regression and multivariate analysis. Mind-boggling variations due to stratification, censor- ing, truncation, competing risk, and multistate models keep bio- medical statisticians busy. Data. Clinical trials use age-at-death data (duration of response to treatment, time to illness or recurrence) to test hypotheses and quantify treatment effects [Kalbfleisch and Prentice, Klein and Moeschberger]. Clinical trials cost money, and sometimes dis- eases are rare, so sample sizes can be small. Subjects. In some ways, humans are relatively simple: subsys- tems are clear, there is no sell-through time, humans operate one hour per calendar hour, humans usually repair themselves spon- taneously, and fairly specific failure modes are recorded in death certificates. Standardization. Statistical use of age-at-failure data is stan- dardized in Food and Drug Administration, National Institutes of Health (NIH), and drug company procedures and in insurance company actuarial methods. Human actuarial failure rates are published by the Center for Disease Control and Prevention (CDC) and used by insurance companies and the Social Security Administration. Profession. Biomedical statistics professional organizations the American Statistical Association (more than 16,000 mem- bers), the Bernoulli Society, the Biometric Society, and Institute of Mathematical Statistics are somewhat academic. Nearly every large nation has a professional statistical organization. The Royal Statistical Society (England) was inaugurated in 1834. The Society of Actuaries requires comprehensive examinations for regular membership. Support. The biomedical statistics profession is well supported by the federal government: NIH, National Institute on Drug Abuse, National Center for Health Statistics, and CDC. Statistical programs are part of the federal budget, although many are baseball and census statistics. Drug companies and health organizations employ hundreds of statisticians. Publications. Academic publications have a high standard. The following web site lists relevant books: Medical journals abound with peer-reviewed case studies, some- times contradictory. Newspaper headlines and news commenta- tors report drug and treatment developments. Software. Good statistical computer programsexpensive, cur- rent, and well supportedare available. Many include survival analysis. Statistics. Normal distribution statistics predominate, with some nonparametric statistics. Relative risk and proportional hazards models are widely used to represent concomitant variables and 0.002 0.004 0.006 0.008 0.01 Cum. Intensity Derivative -0.0005 -0.001 -0.0015 -0.002 T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 3 3 quantify their effects. Biostatisticians use these models for test- ing hypotheses about concomitant variables, without estimating failure rate functions. The stochastic integral-martingale representation of the cumula- tive failure rate function is widely used to prove asymptotic results, even to evaluate derivative options in stock markets (Black-Sholes). Computer-aided tomography (CAT), nuclear magnetic resonance (NMR), positron emission tomography (PET), and encephalog- raphy statistical methods play an important role in biostatistics, although they aren't survival analysis methods. Reliability SOTA Objectives. Predict, monitor, and improve reliability and use reliability information for design, process, and service decisions. The emphasis is on prediction and test, often accelerated, in addition to analysis of field data in the hands of customers. Systems are complex, and failure modes are sometimes record- ed, sometimes masked, and sometimes unknown. Sell-through time and age measures other than calendar age complicate relia- bility estimation and use. Service often takes a secondary role in companies eager to keep their sales revenue. Field data is usu- ally highly censored. Reliability shares with biomedical statis- tics variations due to censoring, truncation, competing risk, mul- tistate models, and multivariate age measures, but not stratifica- tion. Step stress and fatigue failure models don't have counter- parts in biostatistics. Data. Age-at-failure data is expensive and corrupted by sell- through time and errors, and it may not be classified by failure mode. Many companies quit tracking products and service parts by serial number, so they don't have age-at-failure data. Reliability tests suffer from the same cost limitations as clinical trials. In the aviation industry, only about 75 "fracture-critical" parts per aircraft are tracked by tail number, hours, and cycles. The result is that field reliability is seldom known and used. Standardization. The Federal Aviation Administration (FAA), National Highway Traffic and Safety Administration (NHTSA), and Nuclear Regulatory Commission (NRC) rely on information from, and negotiate with, the organizations they regulate. Military Standards are either too procedural in nature or have been canceled. The Baldrige National Quality Program and ISO 9000 ignore reliability. There are some bright spots, however. Markovian cost analysis of isotope separation plants was first done a long time ago. In the late 1950s, RAND adapted actuarial methods to engine man- agement for the Air Force Logistics Command. NASA has good reliability-based diagnostics for the space station, but not for the space shuttles. Technical Committee 56 of the International Electrotechnical Commission has a series of standards that deal not only with the programmatic aspects of reliability but also with the associated statistical tools. Profession. Most professional reliability organizations are small divisions of relatively nonacademic organizations: Institute of Electrical and Electronics Engineers (IEEE), American Society for Quality (ASQ), Institute of Environmental Science (IES), Institute of Industrial Engineers (IIE), Society for Maintenance and Reliability Professionals (SMRP), and Society Automotive Engineers (SAE). Society Reliability Engineers (SRE) nearly folded when George Chernowitz passed away. Institute for Operation Research and the Management Sciences (INFORMS) and Society for Industrial and Applied Mathematics (SIAM) are academic. ASQ sponsors a certification program for reliability engineers. Two major universities offer graduate degrees in reli- ability and many others offer courses in designing for reliability and the use of statistics in assessing reliability. Support. The Air Force Office of Scientific Research has had no statisticians for years. The military looks for new, advanced weapons while trying to maintain the old ones with old methods. NASA, NRC, and the military funded many potentially useful paper studies, which have been forgotten. The NHTSA won't support the statistics it needs. Companies lay off their reliabili- ty engineers or train people from other professions to act as reli- ability engineers. Publications. Publications frequently print models, methods, and estimators using standard distributions. See for a list. Software. The only commercially viable reliability software seems to be for MTBF prediction, FMEA, FRACAS, ALT, RCM, simulation, and Weibull analysis. For lists of available software, see and . Statistical software ven- dors recognize the mathematical equivalence of survival analysis and reliability in marketing their survival analysis software. Statistics. Exponential and Weibull statistics predominate in reli- ability. MTBF prediction has no counterpart in biomedical statis- tics. Reliability people use accelerated life models just like bio- medical accelerated failure time models. The stochastic integral- martingale representation is beginning to appear in reliability, with the Nelson-Aalen cumulative failure rate estimator and other applications [Bagdonavicius and Nikulin, Aven and Jensen]. Crossover Statisticians take pains to make statistical definitions consistent with lay usage. Unfortunately, lay people believe that MTBF is reliability. It's easy to measure something with a number and difficult to measure it with a function, but a function is necessary to quantify randomness. Not all reliability engineers even agree that reliability is a probability distribution function, few man- agers understand the concept of probability distributions. Reliability people can learn from biomedical survival analysis. Reliability is defined as "the probability of survival to specified ages under specified conditions," which requires estimating the T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 3 4 survival distribution. That's what statisticians do. Of course, reliability people should use the information learned from statis- tical analyses, in addition to using their engineering skills. Survival analysis includes two-sample tests that can be useful for comparing products, processes, environments, customers, before vs. after, and so on. Relative risk models can be used for evalu- ating alternatives and characterizing conditions. They are final- ly being used in reliability analyses [Bagdonavicius and Nikulin, George and Felthauser, Krivtsov et al] and MTBF and reliability prediction [George 2003]. Nonparametric estimators should be adopted for field reliability estimation, because assuming math- ematically convenient failure rate functions hides potentially actionable information. Also, using nonparametric estimators avoids having to defend assumptions. The stochastic integral- martingale representation for failure rate intensity helps prove asymptotic properties of estimators. It can help prove properties of estimators described in the next section. The reliability field can benefit from sharing information in the same way as the biomedical field. Some industry paranoia regard- ing failures prevents potentially useful comparisons and early problem detection. Some organizations try to collect field failure data, including Telcordia, the FAA, the Government Industry Data Exchange Program (GIDEP), and the Reliability Analysis Center (RAC). Information should be shared as freely as possible and used to estimate age-specific reliability and failure rate functions. Companies that share reliability information with customers have a competitive advantage over those that do not. CAT, NMR, PET, and epileptiform foci mapping encephalogra- phy methods may be useful for condition monitoring and for security scanning. (Security, like dependability, is a lay syn- onym of reliability.) CAT and NMR estimate local density with- in a body, within 3D pixels. CAT, NMR, and radiation back- scattering measurement methods are already used in baggage inspection and other security devices. Epileptiform foci map- ping searches for the source of the characteristic electrical signal of epilepsy from electroencephalograms. Reliability applica- tions might need to identify and locate the source of an electrical signal among all those measurable at a surface. Biomedical people can learn from reliability analyses too. Software reliability [Beizer] may be applicable to human cognition and (mis)behavior. Fatigue failure (Miner's rule) and step stress models may be useful in biomedical statistics to represent wearout and changes in treatment. Stress-strength, FTA, and load sharing models don't seem to have applications to biomedical survival analysis, but perhaps readers will recognize some potential use. Relevation (good-as-old), renewal (good-as-new), and hysterical (somewhere in between) statistics for recurrent processes apply to humans as well as to products. Preventive maintenance is widely practiced on humans, but not optimally [Aven and Jensen]. Opportunistic maintenance, driven by reliability, can also be applied in medical treatment. Opportunistic maintenance is the replacement of other parts at the same time as replacement of a failed part, because as long as the system is being repaired, the incremental cost of repairing or replacing other parts is less than the cost of waiting. Long ago, some surgeons removed your appendix if your abdomen was open. There may be other oppor- tunities. Reliability engineers are needed in medicine because of the com- plex machinery used in hospitals, clinics, and laboratories and because of the importance of safety. (I enjoyed working on clin- ical laboratory equipment reliability and contributed the optimal dilution for a cell counter and the discriminant algorithm for WBC diff [white blood cell type percentages]). As device implants become more common, perhaps reliability statistics will become part of biomedical survival analysis. Potential Advances for Both Groups Random samples of age-at-failure data, censored or not, make statistical analysis convenient. Suppose you only have ships (births, installed base, production, etc.) and returns (deaths, com- plaints, repairs, spares sales, etc.) counts by accounting interval. Ships and returns (warranty repairs) counts (Table 1) are statisti- cally sufficient to make nonparametric estimates of reliability and failure rate functions, without tracking humans, parts and products by serial number or name [George 1999]. Table 1. Monthly Ships and Warranty Repair Counts for 1988 Ford V-8 460 Drivetrain, August-December 1987 Figures 4 and 5 show nonparametric estimates of monthly fail- ure-rate functions for age at first warranty repair and for ages between subsequent warranty repairs. They were estimated by least squares () under the assumption that repairs were a renewal process in which the age at first warranty repair has a different failure rate function than the rest. Maximum likelihood estimators are also available [George 2002]. Figure 4 shows that almost 16% fail immediately and another 4% shortly thereafter, probably in the hands of new owners. Figure 5 shows that 13% fail immediate- ly after repair, indicating that the problem wasn't fixed. (The 1988 Ford V-8 460 engine was the last carbureted engine Ford made. It had drivability problems.) These estimates have biomedical applications only for epidemics (hantavirus), new diseases (AIDS), transplants, and other tran- sient processes, because steady state birth and death counts con- tain no information about age at failure. Without the linkage between births and deaths, there is no age-at-failure information, except during the transient portion of stochastic processes. Using population estimates from transient infection and death counts relieves the need for controls; this avoids the ethical dilemma of killing controls. Month Shipments Repairs Aug-87 213 18 Sep-87 6,439 797 Oct-87 6,951 1,291 Nov-87 5,715 1,511 Dec-87 5,390 1,791 T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 3 5 Figure 4. Monthly Warranty Failure Rate Functions for Age at First Warranty Failure Figure 5. Monthly Warranty Repair Rates for Ages Between Subsequent Repairs Estimates from ships and returns counts are applicable through- out industry for field reliability, because many products and most service parts survive their useful lives. Generally accepted accounting principles require ships and returns counts for indus- trial revenue and service cost accounting, and they're population data, so they contain no sample uncertainty. Privacy, important to Congress and the public, can be preserved by use of birth and death counts for survival analysis. The law requires that unique identification numbers be issued to all citi- zens, health care practitioners, health care institutions, employ- ers, and insurance companies to facilitate linking of event infor- mation. This leads to public concern over medical privacy. The NHTSA requires that complaints be filed by vehicle identifica- tion number, personal identification, and detailed crash informa- tion. This led to a stalemate between the NHTSA, directed by Congress in the TREAD act to collect the data, and insurance companies, which objected to providing private information. Linkages to identify age at failure are not necessary for reliabil- ity analysis and some survival analyses. Conclusions Biomedical statisticians and reliability engineers can learn much from each other despite different objectives, and some things to be learned by both. Survival analysts have thoroughly plowed the field of estimation and hypothesis testing from random, cen- sored sample, helping the fortunate few reliability engineers who have age-at-failure and survivor data. Other reliability engineers can make do with analysis of ships and returns counts. Free Nonparametric Estimates For free nonparametric estimates of field reliability, send ships and returns counts to , or enter it in . The author will send back nonparametric estimates of field reliability and failure rate functions, free of charge. References 1. Aven, Terje, and Uwe Jensen, "Stochastic Models in Reliability, Springer," Berlin, 1999. 2. Bagdanovicius, Vilijandas and Mikhail Nikulin, "Accelerated Life Models: Modeling and Statistical Analysis," Chapman and Hall/CRC, Baton Rouge, LA, 2002. 3. Beizer, Boris, "Black-Box Testing: Techniques for Functional Testing of Software and Systems," Wiley, New York, 1995. 4. Cox, D.R., "Regression Models and Life Tables (with discus- sion)," J. Roy. Statist. Soc. Ser. B, Vol. 34, pp. 187-220, 1972. 5. George, L.L., "Field Reliability Without Life Data," ASA, SPES Newsletter, , pp. 13-14, 1999. 6. "Renewal Distribution Estimation Without Renewal Counts," INFORMS, San Jose, , 2002. 7. "Credible Reliability Prediction," ASQ Reliability Division monograph, 2003. 8. George, L.L. and Mark Felthauser, "Reliability of Firestone Tires," , 2002. 9. Kalbfleisch, John D. and Ross L. Prentice, "The Statistical Analysis of Failure Time Data," Wiley, Hoboken, New Jersey, 2002. 10. Klein, John P. and Melvin L. Moeschberger, "Survival Analysis, Techniques for Censored and Truncated Data," Springer-Verlag, New York, 1997. 11. Krivtsov, V.V., D.E. Tanako, and T.P. Davis, "Regression approach to tire reliability analysis," Rel. Eng. and System Safety, Vol. 78, pp. 27-273, 2002. Acknowledgements Ned Criscimagna suggested that I write this article in retaliation for my suggesting that he should. Mark Felthauser, a real statistician, reviewed the article and suggested additions. Eva Langfeldt, , did a wonderful job of copyediting. About the Author Larry George is a Certified Reliability Engineer and Fellow of the American Society for Quality. His education includes B.S. in Engineering, M.B.A., and M.S. and Ph.D. in industrial engineer- ing and operations research with a minor in probability and sta- tistics from the University of California at Berkeley. He taught for 11 years; worked for 11 years at Lawrence Livermore National Laboratory; and has more than 20 years experience in industry, including several years for Abbott Laboratories' Diagnostics Division. 0.16 0.12 0.08 0.04 0 0 3 6 9 12 probability t = age at first warranty failure, months 0.16 0.12 0.08 0.04 0 0 3 6 9 12 probability t = age between warranty failures, months