RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center INSIDE T h e J o u r n a l o f t h e 5 Reliability Theory Explains Human Aging and Longevity 7 Form, Fit, Function, and Interface - An Element of an Open System Strategy 14 System Level Clues for Detailed Part Issues 19 RMSQ Headlines 21 Future Events 22 From the Editor 23 PRISM Column 23 Upcoming June Training Reliability Analysis Center First Quarter - 2005 Abstract The reliability of avionics using commercial-off- the-shelf (COTS) items and products is a concern for the aerospace industry. The results of collect- ing and analyzing field return records of avionics are documented in this article. Our analysis shows that the exponential distribution is still appropriate for describing the life of most avion- ics manufactured over the past 20 years. Results also show that failure rates decrease at the intro- duction of products. An increasing trend in fail- ure rate can be noted, for systems made after 1994, suggesting the need for further investiga- tion. Introduction Microelectronic systems built with COTS are now widely used in the aerospace industry and are becoming increasingly important. After the Department of Defense (DoD) changed the acquisition process (one formerly based on mili- tary standards and specifications) in 1994, mili- tary-specified avionics have become rare. The aerospace industry's use of microelectronics is shrinking as a percentage of the entire market, so it must face the reality of a commercially-driven market. Commercial integrated circuit (IC) prod- ucts' life cycles are decreasing to 2-4 years [Reference 6]. In contrast, the aerospace indus- try assumes the life of a Line Replacement Unit (LRU) is more than 10 years. This discrepancy will worsen given the continuing advancement in functionality and speed in the microelectronic industry. To understand the impact of technolo- gy advancement on avionics, we needed to find out what had happened in field operation. Field records of return-for-service of avionics in the past 20 years were collected and analyzed, and the results are documented herein. Data Collection Return-for-service records were collected from two major suppliers of avionics. Several types of systems were included, such as a flight control system, autopilot, flight director system, and symbol generator. Records from company A include eight systems dating from 1982 to 2002. Company B's records are dated from 1997 to 2002 and include one system. Most of these records include the unit serial number, date sold, return for service date, replaced IC types, and quantities. Some of the original data were found to be insufficient for analysis. We compiled the original records to weed out and discard the use- less ones; the remaining records had sufficient data to support statistically significant conclu- sions. We also made some assumptions to facil- itate the statistical analysis. Our assumptions were as follows. 1. Systems were grouped by type and the year of "date sold" assuming they were manufactured and used in the same year. 2. For units with multiple returns, only the first return was calculated and analyzed. 3. It is assumed that all ICs replaced in serv- ice have experienced failure. This assumption may have caused us to over- estimate the number of failures. 4. Censor time: the time to check the status of system. It is set to April 30, 2002. Based on these assumptions, a C language pro- gram was used to select the useful records, check the end status of the systems, and calculate the service hours. The method used to calculate the service hours follows Figure 1, in which the dif- ferent periods between sold date and return-to- service date are shown. By: Jin Qin, Bing Huang, Joerg Walter, Joseph B. Bernstein, Michael Talmor Reliability Engineering, University of Maryland, College Park Reliability Analysis of Avionics in the Commercial Aerospace Industry T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F i r s t Q u a r t e r - 2 0 0 5 2 SD Sold Date, BISD Begin In Service Date, FD Failure Date, RTSD Return To Supplier Date Figure 1. Time Line of Field Records The P1 interval between SD and BISD includes delivery time and installation time. The unit service period (days), P2, is: P2 = (RTSD - SD) - P1 - P3 (1) P3 is the return time from customers to suppliers. If the unit did not fail to the censor time, the service period is between BISD and the censor time. Generally, there are only SD and RTSD in the raw records. P1 and P3 are estimated based on the informa- tion given by the suppliers. Different suppliers have different P1 and P3. Once P2 is found, the unit service hours are calculated from: ServiceHours = Hon * P2. The Hon is the power-on hours per day of system. Different companies give different Hon. Data Analysis Analysis of System Records from Company A. There are records for about 21,535 systems sold between August 17, 1982 and December 30, 2001 from company A. Categorized by sys- tem type and year of "date sold," there are 87 groups of data, which include 9 groups with zero failures and 6 groups with one failure. The statistical analysis process and results follow. Probability plotting. As the generally accepted lifetime distribu- tion in microelectronic industry, Weibull distribution is used to analyze the service hours. To verify its usage, we plotted proba- bility and calculated the correlation coefficient (CC) of Weibull distribution and lognormal distribution respectively (Groups with 0 or 1 failure are omitted). Results show that the CC for 42 groups of Weibull distribution was greater than the CC for the lognormal distribution. The CCs of Weibull distribution were also compared with the 90% critical CC [Reference 1] to deter- mine if the distribution is appropriate or not. Results show that 62 of 72 groups CC was greater than the given critical CC. Parameter estimation. The parameters of Weibull distribution are estimated by using the maximum likelihood estimation (MLE) method. The histogram of the estimated shape parame- ters is shown in Figure 2. It shows the values of most of the shape parameters are distributed between 0.6 and 1.1. Exponential distribution verification. Although the wide use of exponential distribution has been questioned for a long time, it is unwise to blindly accept or reject it. The exponential distri- bution was theoretically shown to be the appropriate failure dis- tribution for complex systems by R.F. Drenick [Reference 5]. He stated that "Under some reasonably general conditions, the distribution of the time between equipment failures tends to the exponential as the complexity and the time of operation increas- es; and somewhat less generally, so does the time up to the first failure of the equipment." Figure 2. Weibull Shape Parameter Histogram In the microelectronic industry, due to the advance of technology, chips are becoming more and more complex following Moore's law. Additionally, avionics have complex structures. A flight director system may consist of 460 digital ICs, 97 linear ICs, 34 memories, 25 ASICs, and 7 processors. The number of compo- nents in such a system is huge. For these components, external failure mechanisms caused by random factors such as electrical overstress, electrostatic discharge, and other environmental and human interaction, and intrinsic failure mechanisms, which include dielectric breakdown, electromigration, and hot carrier injection, can cause the components to fail. These failure modes combine together to form a constant failure rate process, as Abernethy [Reference 2] stated that as the number of failure modes mixed together increases to five or more, the Weibull shape param- eter will tend toward one unless all the modes have the same shape parameter and similar scale parameter. Some recent research that focuses on intrinsic wearout failure mechanisms lends support to the exponential distribution. Degraeve [Reference 4], Stathis [Reference 7], and Alam [Reference 3] pointed out that the Weibull shape parameter of oxide breakdown is thickness dependent and goes to unity for ultra-thin oxides. As the Weibull shape parame- ter approaches 1, the intrinsic wearout becomes more random and the device times to failure become statistically indistinguishable from a random pattern of times to failure. We use the likelihood ratio test to verify the hypothesis of the exponential distribution the special case of Weibull distribution with the shape parameter equals to 1. Setting the significance level to 0.05, for systems grouped in different years, the likeli- hood ratio test is done using the following steps. a. H0: = 1; H1: 1 5, *15, ., 465, 2 2 2! Weibull Shape Parameter Histogram 0 2 4 6 8 10 12 14 16 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 Beta Fr eque nc y T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F i r s t Q u a r t e r - 2 0 0 5 3 b. Calculate the statistics T = 2( ) : global maximum log likelihood : constrained maximum log likelihood at = 1 c. If, T < 2 (0.95, 1), accepts H0, else rejects H0. The hypothesis test results show that exponential distribution is acceptable for 56 groups. System failure rate results. Since the exponential distribution is appropriate for most of those systems, we use MLE to calculate the failure rate. The systems' failure rates vs. year are shown in Figure 3. The data shows that, with the exception of system 2 and 8, the systems' failure rates decrease at the beginning of use. For system 4, 5, 6, and 7, whose use spanned the 1980s and 1990s, the trend of system reliability increase around 1994 and after that, could be noted. System 1 shows the same trend around 1997. Analysis of system records from company B. Records from company B are dated between January 14, 1988 and October 27, 2001. Since the population size and the failure number of each year are small, we statistically analyze the moving five-year's records using the exponential distribution to get better results. We also analyze all records of company A in the same way to compare the change in reliability. Figure 4 shows the overall failure rates of systems from company A and B (Year in the X- axis is the middle point of the moving five-year period). From this result, we determined that there is an increasing trend of fail- ure rate after 1994 for systems from both companies. 0 L^ - L^ L^ 0 L^ System 1 0.00 5.00 10.00 15.00 20.00 25.00 30.00 1992 1994 1996 1998 2000 2002 Year Fa ilur e rate (E- 6) System 2 0.00 1.00 2.00 3.00 4.00 5.00 1982 1983 1984 1985 1986 1987 Year Fai lur e ra te (E -6) System 3 0.00 2.00 4.00 6.00 8.00 10.00 12.00 1985 1986 1987 1988 1989 1990 1991 1992 1993 Year Failu re rate (E -6) System 4 0.00 1.00 2.00 3.00 4.00 5.00 1989 1991 1993 1995 1997 1999 2001 Year Fai lur e ra te (E -6) System 5 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 1985 1987 1989 1991 1993 1995 1997 1999 2001 Year Fai lur e R ate(E -6) System 6 0.00 1.00 2.00 3.00 4.00 5.00 6.00 1981 1983 1985 1987 1989 1991 1993 1995 1997 Year Fa ilur e rate (E -6) System 7 0.00 2.00 4.00 6.00 8.00 10.00 1985 1987 1989 1991 1993 1995 1997 1999 Year F ailu re rate (E -6) System 8 0.00 2.00 4.00 6.00 8.00 1997 1998 1999 2000 2001 2002 Year Failu re rate (E -6 ) Figure 3. Failure Rates of Systems with 90% Confidence Intervals T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F i r s t Q u a r t e r - 2 0 0 5 4 IC failure analysis. We can get the type and number of replaced ICs from company A's records but only the number of failed ICs from company B's records. Since no information was available for tracing down the failure mechanism, we simply calculated the over- all failure rate of all ICs from company A and from Company B. For company B's records, we used the exponential distribution to analyze the moving five-year IC failure data because of the small number of failures in each year. Company A's IC failure records were analyzed in the same way. The results are shown in Figure 5. Summary Field data of microelectronic systems in the aerospace industry was collected and analyzed. Based on our statistical analysis results, we found that: 1. The exponential distribution is appropriate for most avionics' lifetime analyses because the IC chips and sys- tem structure are becoming more complex. 2. System reliability generally improves in the first several years after introduction and drops off later. It follows very well the known phenomena of "infant mortality" or "learning curve." 3. According to the analysis, the failure rate of several sys- tems increases, almost constantly, after 1994-1996. The increase isn't large and not statistically significant. No one specific reason of this trend could be postulated due to the lack of information. It could be due to design prob- lems in replacement military grade components by com- mercial or due to total redesign in introducing new tech- nologies, inherent reliability of commercial components or manufacturing problems in introducing new for avion- ic system packaging standards, etc. This work presents some practical observations. A future inves- tigation, tracking of the failure data and failure analysis, is sug- gested. For Further Reading 1. Abernethy, R.B., The New Weibull Handbook, Third Edition, page 3-3. North Palm Beach, Florida: R.B. Abernethy, 1998. 2. Abernethy, R.B., Ibid, page 3-14. 3. Alam, M., B. Weir, and P. Silverman, "A future of function or failure? (CMOS gate oxide scaling)," IEEE Circuits and Device, Vol. 18, pages 42-48, March 2002. 4. Degraeve, R., "New insights in the relation between electron trap generation and the statistical properties of oxide break- down," IEEE Transaction on Electron Devices, Vol. 45, pages 904-911, April 1998. 5. Drenick, R.F., "The failure law of complex equipment," Journal of the Society for Industrial and Applied Mathematics, Vol. 8, pages 680-690, December 1960. System Failure Rate 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 Year Failu re R ate(E -6) Company A Company B Figure 4. Overall System Failure Rates from Company A and B (90% Confidence Interval) IC Failure Rates 0 5 10 15 20 25 30 1984 1986 1988 1990 1992 1994 1996 1998 2000 Year(sold) Failure R ate(FIT) Company A Company B ? ? ? ? ? ? ? ? ? ? Figure 5. Overall IC Failure Rate (90% Confidence Interval) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F i r s t Q u a r t e r - 2 0 0 5 5 6. Hnatek, E.R., Integrated Circuit Quality and Reliability, Marcel Dekker, Inc, 2nd Edition, 1995. 7. Stathis, J.H., "Percolation models for gate oxide break- down," Journal of Applied Physics, Vol. 86, pages 5757- 5766, November 1999. About the Authors JIN QIN is a PhD candidate of Reliability Engineering at the University of Maryland, College Park. His research topics include reliability testing, reliability data analysis, and microelec- tronic system reliability estimation. He holds a Master of Science in Reliability Engineering from the University of Maryland and a Master of Engineering in Management Science and Engineering from the University of Science and Technology of China. Bing Huang currently is a PhD Candidate of Reliability Engineering program at the University of Maryland. He received a B.S. in Mining Engineering from the University of Science and Technology of Beijing, and a M.S. in Nuclear Engineering from Tsinghua University. Joerg Walter Dr. Joerg D. Walter is an Assistant Professor of Aerospace Engineering at the Air Force Institute of Technology (AFIT). He holds a PhD in Reliability Engineering from the University of Maryland (2003) and a Masters of Science in Systems Engineering from AFIT (1997). Joseph B. Bernstein Dr. Bernstein is an Associate Professor of Reliability Engineering at the University of Maryland, College Park. Professor Bernstein's interests lie in several areas of micro- electronics reliability and physics of failure research including sys- tem reliability modeling, gate oxide integrity, radiation effects, MEMS and laser programmable metal interconnect. Research areas include thermal, mechanical, and electrical interactions of failure mechanisms of ultra-thin gate dielectrics, next generation metallization, and power devices. Dr. Bernstein is currently a Fulbright Senior Scientist at Tel Aviv University in the Department of Electrical Engineering, Physical Electronics where he started a Maryland/Israel Joint Center for Reliable Electronic Systems. Michael Talmor is a Certified ASQC Quality (CQE) and Reliability (CRE) Engineer. He holds Master's degrees in Reliability and Quality Assurance from the Technion - Israel Institute of Technology in Haifa and in Electrical Engineering, Automatics and Telemechanics from the Electrotechnical Institute, Saint Petersburg, Russia. Michael is currently a Visiting Researcher at UMD during his sabbatical leave from RAFAEL Ltd, Israel. Our bodies' backup systems don't prevent aging, they make it more certain. This is one offshoot of a new "reliability theory of aging and longevity" by two researchers at the Center on Aging, National Opinion Research Center (NORC) at the University of Chicago. The authors presented their new theory at the National Institutes of Health (NIH) conference "The Dynamic and Energetic Bases of Health and Aging" (held in Bethesda, NIH). Their theory of aging has been published by the "Science" magazine department on aging research, Science's SAGE KE ("Science of Aging Knowledge Environment"). The authors say, "Reliability theory is a general theory about sys- tems failure. It allows researchers to predict the age-related fail- ure kinetics for a system of given architecture (reliability struc- ture) and given reliability of its components." "Reliability theory predicts that even those systems that are entire- ly composed of non-aging elements (with a constant failure rate) will nevertheless deteriorate (fail more often) with age, if these systems are REDUNDANT in irreplaceable elements. Aging, therefore, is a direct consequence of systems redundancy." In their paper, "The quest for a general theory of aging and longevity" (Science's SAGE KE [Science of Aging Knowledge Environment] for 16 July 2003; Vol. 2003, No. 28, 1-10. ), Leonid Gavrilov and Natalia Gavrilova offer an explanation why people (and other biological species as well) deteriorate and die more often with age. Interestingly, the relative differences in mortality rates across nations and gender decrease with age: Although people living in the U.S. have longer life spans on average than people living in countries with poor health and high mortality, those who achieve the oldest-old age in those countries die at rates roughly similar to the oldest-old in the U.S. The authors explain that humans are built from the ground up, starting off with a few cells that differentiate and multiply to form the systems that keep us operating. But even at birth, the cells that make up our systems are full of faults that would kill primi- tive organisms lacking the redundancies that we have built in. "It's as if we were born with our bodies already full of garbage," said Gavrilov. "Then, during our life span, we are assaulted by random destructive hits that accumulate further damage. Thus we age." "At some point, one of those hits causes a critical system with- out a back-up redundancy to fail, and we die." As the authors puts it, "Reliability theory also predicts the late- life mortality deceleration with subsequent leveling-off, as well as the late-life mortality plateaus, as inevitable consequences of redundancy exhaustion at extreme old ages." Reliability Theory Explains Human Aging and Longevity Reprinted with permission of Dr. Leonid A. Gavrilov, Center on Aging, NORC/University of Chicago