Achieving High Reliability
By: Larry H. Crow, Ph.D.
This article discusses issues related to the concepts presented by the author in his paper, On the Initial System Reliability, Proceedings 1986 Annual Reliability and Maintainability Symposium, pp. 115-119, Las Vegas, NE.
Introduction
In todays environment of reduced development budgets, faster times to market, reduced test time, and the wide use of non-developmental items, attaining high reliability for complex systems is very difficult but critical. For reliability affects not only system performance but also operating and support costs. Achieving high reliability is receiving increased interest and was addressed at the June 9-10, 2000 Committee on National Statistics Workshop on Reliability issues for DoD Systems, held at the National Academy of Sciences, Washington, DC.
In the authors almost 30 years in the reliability field, he observed why high reliability requirements are not met. He then identified eight principles that consistently yield very high reliability systems. He found that applying these principles did not increase the overall costs of the reliability program but, as implementation was refined and better understood, actually decreased them by more than one third. This methodology simply integrates sound reliability and parts management strategies in early design. This article provides a discussion of the issues being addressed and an overview of the eight principles.
Discussion
Data presented at the June workshop showed that many of todays new DoD systems fall short of their operational reliability requirements based on the results of Operational Testing (OT). Typically OT occurs after Development Testing (DT) and the OT reliability estimate is often the total test time divided by the total number of observed failures. This estimate is of an MTBF, which is often the reliability parameter of choice but may not be the most meaningful reliability parameter (particularly for systems consisting of both a repairable and nonrepairable segment).
Generally, DT objectives include evaluating performance and reliability parameters, identifying problems, and making management and engineering decisions on the incorporation of corrective actions. The measure of reliability during DT is a function of several factors including the total amount of test time and the value of reliability at the beginning of this testing. Everything being equal, the less test time available, the higher the initial reliability must be to reach the reliability goal at the end of DT. Another potentially significant factor is delaying corrective actions until late in testing, say just prior to OT testing. Assessing the impact of these delayed fixes on the total system reliability is generally not straightforward and requires the use of a proper projection methodology. A commonly used method overestimates the system reliability after delayed fixes and may indicate that the reliability meets requirements, when in fact it does not. These are just some of the reasons that the OT reliability may be lower than desired.
The author recognized that reduced development budgets and schedules make a corresponding reduction in DT testing inevitable. Consequently, the initial reliability going into testing must be higher than it has been in the past. Initial reliability is the result of the early, basic engineering design effort for reliability and is the input into DT testing. Initial reliability is a key metric and a measure of how effective the basic reliability tasks, such as requirements analysis, trade studies, modeling, allocation, prediction, failure modes and effects analysis (FMEA), and parts and vendor selection, have been. What has the initial reliability been in the past? For an answer, we look at studies conducted by the US Army Materiel Systems Analysis Activity (AMSAA).
In 1984 and 1990, AMSAA conducted two studies of Army systems. Both studies showed that the ratio of the initial MTBF to the final mature system MTBF was about 1:4 to 1:3. If the final mature reliability was 1000 hours MTBF, for example, then the initial reliability coming out of early design and entering DT was an average of 250 to 300 hours MTBF. These studies also showed that the average amount a failure modes rate of occurrence was reduced because of corrective actions - the Effectiveness Factor (EF) was about 70%. That is, corrective actions increase the failure mode MTBF by an average of about 3.3 (conversely, a problem failure modes rate is reduced, on average, by 70%). If we couple this fact with the concept that valid reliability prediction estimates the inherent, mature failure rate of a failure mode, then we have a basis for a reliability growth (RG) metric in design.
The 1:4 to 1:3 ratio may have been acceptable several years ago when more DT test time was available. Today, with much less test time, such a ratio will not allow the potential reliability to be reached. A logical solution to the problem of low OT results is to increase the initial reliability in early design. This can be accomplished by performing the same reliability tasks noted earlier, but somewhat differently and applying a metric that estimates the initial reliability during design. If the initial reliability is actually improving in design, as it should be, then reliability growth in design is occurring. With a higher initial reliability, the RG program in DT has a better chance of success. This integrated RG is the framework for the reliability management principles presented later.
The framework is based on systemically managing failure mode identification, classification, analysis, and mitigation. In this paper, a failure mode is a problem and a cause. A given problem can result from multiple causes and corrective action takes place on a problem and cause basis. The 70% EF noted earlier applies to corrective action on a problem and cause, and relates to this definition of failure mode.
At the end of this discussion are listed the eight principles or features of a reliability program that the author has applied that consistently yields high reliability, state of the art systems. Many others have successfully applied these basic principles, and examples were given at the June 2000 Workshop. In the authors applications, the programs had a preliminary design phase (PDP) and a final design phase (FDP). The PDP included requirements analysis, trade studies, preliminary modeling, allocation, redundancy analyses, preliminary prediction, preliminary FMEA, and preliminary parts and vendor selection. In the final design phase, more complete reliability tasks were conducted. Also, during this phase, potential problem failure modes (RG in design), were systematically identified and mitigated, with metrics to track progress using the FMEA.
In the FMEA, failure modes are classified as either a potential A mode or a potential B mode. A failure mode is a B mode until it meets the criteria for an A mode, which are:
1. There is a numerical calculation of the failure rate
2. This numerical calculation is substantiated by at least one of the following: analysis, analogy, or test.
3. The failure rate is acceptable given the system reliability requirement or goal.
In an ideal situation, if all failure modes are classified as A modes, then the overall system failure rate should equal or be close to the reliability prediction. On the other hand, an investigation may prove that a failure mode classified as a potential B mode does not need any improvement; that is, it satisfies the Amode criteria. However, corrective action (i.e., reselection of a part or vendor, added redundancy, mitigation of environmental stress, better materials, wider design tolerances, or manufacturing changes) may be needed. The amount of actual improvement will depend on the EF. However, if an average EF is applied, such as 0.7, then an assigned failure rate to the B mode before investigation is 3.3 times the predicted failure rate. Of course, any assigned EF or B mode failure rate deemed appropriate can be applied. This approach would estimate the initial MTBF to be somewhere between 30% to 100% of the predicted, depending on the percent of A modes assigned to the system in the FMEA. As potential B modes are mitigated, this estimate would increase. This is the reliability growth metric discussed under Principle 5. See Figure 1.
Figure 1. Metric is estimate of MTBF when design improvement is stopped (Click to Zoom)
Recommended Principles for a Successful Reliability Program
1. Requirements and Failure Definition Analysis. Requirements must be fully understood and determined to be attainable using current technology. Also, failure should not be confused with performance. The most meaningful reliability metric may not be MTBF, particularly when the system consists of both a repairable and non-repairable segment. A useful metric in this case is Probability of Mission Success, which considers mission length, total calendar time for the mission, reliability, repair time, and total spares allocation.
2. Integrated Reliability Growth Testing (IRGT). In many cases reliability problems are surfaced early in engineering tests. The focus of these tests is typically on performance and not reliability. Therefore, if the problem is not brought to the attention of reliability it may not be corrected early, when it is the most cost effective and impacts schedules least. IRGT simply piggybacks reliability failure reporting, in an informal fashion, on all engineering tests. When a potential reliability failure is observed, notify reliability engineering.
3. Closed Loop Failure Mode Mitigation Process. Usually, patent or potential reliability problems can be mitigated by the reliability engineer and product design team. Sometimes, however, a potential problem needs special management attention due to high risks, costs, criticality, additional screening or testing, or schedule impact. Without a focused approach, resolution can be time consuming and expensive. For these critical problems, a reliability mitigation process at the system engineer and program manager level can greatly decrease the time and cost of a solution. In this process, the concern is documented and assigned to the appropriate person for resolution, in much the way as a failure is reported. But in this case, the failure has not yet occurred. The process is most effective when managed by the program manager, system engineer, and the reliability manager.
4. The Parts and Vendor Selection Process Addresses Reliability. Parts and vendor selection must be conducted in early design since most of the parts used in early design are used in the final design. Immediately after the design engineer has determined a part can perform the desired function, it should undergo a parts and vendor selection process for reliability assessment and approval. That is, the part must be shown to provide the function and be reliable before being approved for use. Because vendor quality for that part affects the parts reliability, this process should evaluate the part and vendor combination, not just the part. Depending on the information obtained, this assessment will lead to a reliability estimate based on data or a prediction using, for example, the RACs PRISMŽ model. Only if this estimate is consistent with the allocation or expectations, will the part be formally approved. Some mitigation options are to consider other parts or vendors, subject the part to additional screening, incorporate redundancy, or accept more risk. When the mitigation options require additional resources or potential redesign, increase cost or schedule, or are high risk, then others (e.g., program manager, systems engineer, product team leader, design engineer, reliability engineer) may need to get involved. To do this efficiently, the closed loop failure mode mitigation process is used. This process focuses on a solution and risk management in a documented and effective manner.
5. Manage the Failure Mitigation Process with the FMEA and Calculate Metric. The FMEA should be used to identify the systems failure modes and also to identify potential problem areas affecting reliability and safety. This purpose can easily be met by adding a column to a standard FMEA sheet and classifying each failure mode according to its A-B mode status. In the preliminary design phase the assigned reliability value for each failure mode would typically be the allocation or prediction. In the FDP, the A modes are given their calculated value and the potential B modes failure rates are increased using the EF approach or some other method. These estimates are put into the system reliability model to generate an estimate of the initial reliability metric. As more B modes are mitigated, the metric will increase.
6. Formal Reviews for Reliability. A formal review for reliability should be held at least once in both the PDP and FDP. These reviews give the latest reliability status of the system and baselines the reliability model to the current design. This assures that the reliability model and engineering design agree and that earlier proposed design changes (e.g., redundancy) are reflected in the current design. In the PDP the allocated and early predictions are presented; in the FDP, the initial reliability metric is presented.
7. Link Design and Reliability Testing. For many complex systems, the initial reliability at the end of the FDP may still fall short of the requirement. This possibility should be planned for and a target minimum value of the initial reliability established. This value should be linked to the available amount of follow-on reliability DT. If it is not and the initial reliability is too low or the allocated test time is too short, then the requirement will probably not be met.
8. Apply Valid Methodology for Assessing the Reliability in Testing. The caution here is estimating the impact of a group of delayed corrective actions on the reliability of the system. A common approach in practice significantly overestimates the actual reliability. If this approach is applied, then the reliability may appear much higher than it actually is, and contribute to lower than expected operational reliability. A valid methodology for estimating the reliability improvement due to delayed corrective actions exists and is recommended.