Reliability growth is the improvement of a product's reliability over time (hence the term, growth) through learning about the deficiencies of the design and taking action to eliminate or minimize the effect of these deficiencies.
When the subject of reliability growth is discussed, it is usually reliability growth testing that is the focus of discussion. This focus is neither surprising nor altogether unwarranted. In general, testing is a necessary and standard part of development, needed to prove the merit of a design and the validity of the models and analytical tools used to develop the design. In regard to reliability growth testing (RGT), much work has gone into developing various statistical models for the purpose of planning and tracking reliability growth achieved through testing. Given the high cost of testing, the attention paid to the reliability growth test process is natural.
A commonly used model for reliability growth is the Duane model, named after its developer, J. T. Duane, who published a paper in 1964 on the subject. In his paper, Duane stated:
"Time variations of reliability presents (sic) problems only in the early stages of development. Once any specific equipment design has been fully developed and used for some time in service, its reliability stabilizes at a relatively fixed value. However, during test and initial application, deficiencies are often detected which require design changes to improve reliability. These changes are the source of the problem. They complicate the preparation of statistically valid analysis of equipment performance. They introduce conflict between the designer and the reliability engineer. The designer prefers to ignore any failures which he feels he has corrected, while the reliability engineer views these failures as the only meaningful data available. Resolution of the conflict is possible if a technique can be devised for use of the data from early failures and operating experience while taking proper account of improvements resulting from design changes. Since the basic process involved is one of learning through failures, knowledge of a generally applicable learning curve would provide a means of measuring and predicting reliability during this period of change. All of the available test data would be effectively utilized."
Table 1 provides additional information on the Duane Model. Because of its simplicity, the Duane model is frequently used. At each failure, the accumulated test time is calculated and the cumulative failure rate (total failures/
total test time) plotted against it on log-log paper. The equation parameters (K and α) are determined, often by fitting the data points to a straight line by least-square analysis. With this information, the current failure rate and the time required to achieve a desired failure rate can be computed.
Table 1: Duane Model for Reliability Growth
Find failures during test and learn from those failures by redesigning to eliminate them
Relationship between mean time between failure (MTBF) and test time will be a straight line when plotted on log-log paper. Requires that design changes (fixes) be incorporated immediately after a failure and before testing resumes.
Growth rate, α = change in MTBF/time interval over which change occurred
K, a constant which is a function of the initial MTBF
T, the test time
Cumulative MTBF: MTBFc = (1 / K) Tα
Instantaneous MTBF: MTBFi = MTBFc / 1 - a
Test Time: T = [(MTBFi)(K)(1 - α)]1 / α
Another popular growth model was developed at the U.S. Army Materiel Systems Analysis Activity (AMSAA) by Dr. Larry H. Crow. This model is based on the assumption that reliability growth is a non-homogeneous Poisson process. That is, the number of failures in an interval of time (or cycles, miles, etc., as appropriate) is a random variable distributed in accordance with the Poisson distribution, but the parameters of the Poisson distribution change with time. It is an analytical model which permits confidence inteval estimates to be computed from the test data for current and future values of reliability (MTBF) or failure rate. In addition, the model can be applied to either continuous or discrete reliability systems, single or multiple systems, and tests which are time or failure truncated. Some details are provided in Table 2. Complete details for using Crow's model are contained in MIL-HDBK-189, Reliability Growth Management.
Table 2: AMSAA/Crow Growth Model
Objective, Key Assumptions
Same as for the Duane Model
λ, the initial failure rate (1/MTBF)
β, the growth rate
T, the test time
Cumulating Failure Rate:
λc = λTβ-1
Instantaneous Failure Rate:
λi = λβTβ-1
T = [λi / λβ]1/(β-1)
Reliability Growth Test Planning
RGT requires careful planning to avoid problems when the data are evaluated. Some factors to consider when evaluating a planned growth program using any growth model are:
The literature commonly cites test lengths of between 5 and 25 times the predicted MTBF
Growth rates of 0.5 or higher are rare (very high growth rates indicate that the test planners are either being too optimistic regarding the effectiveness of design changes or expect a lot of failures)
If the initial MTBF is very low, it may be that the equipment is entering growth testing too soon (i.e., the pure design process is being terminated prematurely)
On average, design changes are 70% effective in correcting a problem
The period of time to verify the effectiveness of a design change should be at least 3 times the frequency of the failure mode being corrected (i.e, if the mode occurs every 50 hours, the verification time for the design change should be 150 hours)
The starting point of the growth curve can greatly influence the calculated growth rate during the early phases of the growth analysis
The growth rate experienced is a function of the design team's ability to identify and implement effective corrective actions.
Growth Without Testing?
For products such as one-of-a-kind satellites, testing of the types normally associated with reliability growth is seldom possible due to the high cost and limited (if any) availability of test articles. Can reliability growth be achieved for such systems? To answer this question, consider how a design evolves and is finally given form in a prototype model. Generally, iterations of the design are needed because the various performance requirements often conflict; optimizing the design to meet one requirement can result in the design failing to meet another requirement. Balancing the requirements is a demanding task. Iteration is also needed because not all analyses can be done simultaneously. Consequently, the design may be changed as the result of a particular analysis, only to be changed again when the results of a subsequent analysis are available. As iterations take place, the design is refined, and each revised design is (hopefully) an improvement over its predecessor. Some of the analyses conducted during the design process directly address the reliability of the design. So, the reliability of the design improves as successive design changes are made based on analysis.
With the preceding discussion in mind, reliability growth can be broadly defined as:
The process by which the reliability of an initial design is improved. Improvement results as the design is iterated, either on the basis of analytical evaluation and assessment or on test results (failures).
The concept of reliability growth suggested by this broader definition is illustrated in Figure 1. Ideally, when the product enters testing, all deficiencies have been eliminated through design changes made as a result of analyses. In practice, some design changes will be required as the result of undiscovered design deficiencies causing failures during development testing. A specific type of development testing often dedicated to the reliability growth process is the RGT.
Figure 1. Reliability Growth Begins with Design Iteration, Not Test (Click to Zoom)
(Note that the pure design process and the design-test process are not purely separate processes occurring sequentially. They often overlap, although the pure design phase does begin before any testing begins).
Several key points are made in the stated definition, and each one will be addressed separately.
Many initial designs are extrapolations of previous designs. Some are truly "new" with no predecessor. In either case, the initial design is subjected to close scrutiny before any prototypes or breadboards are built for testing and long before the final design review. This scrutiny is in the form of analyses used to evaluate and assess the design. Many types of reliability analyses are used to evaluate and assess the reliability of the product, including failure modes and effects analysis, fault tree analysis, sneak circuit analysis, worst case circuit analyses, and finite element analysis. As weaknesses in the initial design are uncovered through analyses, the design is changed and the analyses repeated. The design-analyze-redesign process can be considered the pure design process.
This iterative pure design process normally continues until the designers believe that further design iteration on the basis of analyses alone, without some testing of prototypes, breadboards, etc., would add little or no value. This point is critical. Designers don't want to continue to spend scarce resources iterating the design if the potential payoff is small. If, however, the pure design process is stopped too soon, designers may implicitly rely too much on the development test process to find design shortcomings (a trial and error approach to design).
For large "one-shot" devices, prototypes and test articles of the entire product are too expensive to build. Test articles of subsystems and critical components may be built, but seldom is the total product tested to any great degree. In such cases, the need for a "complete" pure design process is obvious. In any case, building and testing hardware before the pure design process is complete is not prudent.
Development and Reliability Growth Testing
Ideally, the pure design process would be perfect, with no testing required to improve reliability to meet the requirement. However, analytical tools, models, and engineering judgement are not perfect, so some development testing is always needed to fill in the gaps in our knowledge and understanding. As performance deficiencies are observed and failures are uncovered, design engineers should take two distinct actions:
Examine the models and tools used to revise, refine, or otherwise improve them. The improved tools and models can be used to improve the next design process.
Improve the design based on information gained through the analysis of test data. In the case of failures, each should be thoroughly analyzed.
To properly analyze failures, the following information regarding the failure must be recorded:
The conditions (environmental, operational, etc.) under which failure occurred
How the failure was discovered (what were the symptoms)
The effects of the failure
The probable consequences of the failure in actual use
The analysis itself must provide information on the underlying failure mechanism, the probability of recurrence in actual use, and the corrective actions that can be taken to prevent recurrence or minimize the effects of failure. If design changes are identified as the needed corrective action, reliability growth will occur when and if effective changes are incorporated. Often, improvements in reliability are claimed on the basis of planned changes that have yet to be validated. Making decisions based on planned changes is risky. Changes must be incorporated and the effectiveness of the changes in correcting the problem verified.
Dedicated Reliability Growth Testing
RGT is a type of development test. Traditionally, a special test is dedicated as a reliability growth test. Failures that occur during the test are analyzed and corrective actions are developed to prevent or mitigate the effects of recurrence. The time and resources available for such tests are limited. Many other development tests are conducted during a development program, including functional, environmental, and proof testing. Nothing in the underlying philosophy of the reliability growth process prohibits the analysis of failures that occur during such development tests. All types of testing are potential sources of failure information if appropriate data are collected to allow for the thorough analysis of the failure. Indeed, it is essential that failures from all types of development testing be analyzed to validate the design and the tools and models used to create that design.
It is in estimating the level of reliability being achieved that the use of failure data from all developmental testing presents some difficulty. Combining the data from dissimilar tests is a statistically complex issue. One way to avoid the issue is to use the failures from all testing for engineering purposes (i.e., validate the design and the tools and models used to create that design) and to base estimates of reliability only on the data from dedicated growth testing.
Use of Reliability Growth Test Results
The primary purpose of growth testing is to validate the design and the tools and models used to create that design. At times, managers have used RGT as the sole basis for determining compliance with contractual specifications. This tendency to change the purpose of growth tracking is strongest when qualification or verification testing is greatly reduced or eliminated to save money or time. Using RGT for determining compliance can affect the way in which such testing is approached. From an engineering perspective, failures are not "bad" because they provide valuable information to the designer regarding the adequacy of the design. Through the design-test process the designers can refine the engineering and design tools and models they use and improve the design. When RGT is used to determine contractual compliance, it becomes a pass-fail test, and failures are an unwelcome occurrence. Debates as to whether a failure is "relevant," or whether an event was really a failure, are apt to become a normal part of the failure analysis process. The original purpose of the testing can become obscured, the motivation to uncover problems compromised, and the real value of the testing lost.
It is likely that RGT will continue to serve the two purposes just discussed. To ensure the original purpose is not totally lost, the ground rules of the testing must be well defined and agreed upon long before testing begins.
For Further Study
Crow, Dr. Larry H., "Reliability Growth Projection from Delayed Fixes," Proceedings, Annual Reliability and Maintainability Symposium, 1983, pp. 84-89.
Duane, J.T., "Learning Curve Approach to Reliability Monitoring," IEEE Transactions on Aerospace, Vol. 2, No. 2, 1964.
Meth, Martin A., "Practical `Rules' for Reliability Test Programs," ITEA Journal of Test and Evaluation, Vol. 14, No. 4, Dec 93/Jan 94. d. "Programmes for Reliability Growth," IEC 1014, 1989.
"RIAC Blueprints," Volumes 1 and 6, Reliability Analysis Center, Rome, NY 1996.
"Reliability Growth - Statistical Test and Estimation Methods," IEC 1164, 1995.
"Reliability Test Methods, Plans and Environments for Engineering, Development, Qualification and Production," MIL-HDBK-781A, 1 April 1996.
"Reliability Toolkit: Commercial Practices Edition," Reliability Information Analysis Center, Rome, NY, 1995.
Selby and Miller, "Reliability Planning and Management (RPM)," Paper presented at ASQC/SRE Seminar, Niagara Falls, NY, September 26, 1970.
MIL-HDBK-189, "Reliability Growth Management," February 13, 1981.
About the Author
* Note: The following information about the author(s) is same as what was on the original document and may not be correct anymore.
Ned H. Criscimagna is a Senior Engineer with IIT Research Institute (IITRI). At IITRI, he has been involved in projects related to Defense Acquisition Reform. These have included a project for the Department of Defense in which he led an effort to benchmark commercial reliability practices. He led the development of a handbook on maintainability to replace MIL-HDBK-470 and MILHDBK-471, and the update to MIL-HDBK-338, "Electronic Reliability Design Handbook." Before joining IITRI, he spent 7 years with ARINC Research Corporation and, prior to that, 20 years in the United States Air Force. He has over 32 years experience in project management, acquisition, logistics, availability, reliability and maintainability.
Mr. Criscimagna holds a Bachelor's degree in Mechanical Engineering from the University of Nebraska-Lincoln, a Master's degree in Systems Engineering from the Air Force Institute of Technology, and he did post-graduate work in Systems Engineering and Human Factors at the University of Southern California. He completed the U.S. Air Force Squadron Officer School in residence, the U.S. Air Force Air Command and Staff College by seminar, and the Industrial College of the Armed Forces correspondence program in National Security Management. He is also a graduate of the Air Force Instructors course and has completed the ISO 9000 Assessor/Lead Assessor Training Course. Mr. Criscimagna is a member of the Amercian Society of Quality (ASQ) and a Senior Member of the Society of Logistics Engineers (SOLE). He is a certified Reliability Engineer, a certified Professional Logistician, chairs the ASQ/ANSI Z-1 Dependability Subcommittee, is a member of the US TAG to IEC TC56, and is Secretary for the G-11 Division of the Society of Automotive Engineers.