Risk Management and Reliability
By: Ned H. Criscimagna
Introduction
Risk management is one of the critical responsibilities of any manager. The term "risk management" is used by managers and analysts in a number of diverse disciplines. These include the fields of statistics, economics, psychology, social sciences, biology, engineering, toxicology, systems analysis, operations research, and decision theory.
Risk management means something slightly different in each of the disciplines just mentioned. For social analysts, politicians, and academics it is managing technology-generated macro-risks that appear to threaten our existence. To bankers and financial officers, it is usually the application of techniques such as currency hedging and interest rate swaps. To insurance buyers and sellers, it is insurable risks and the reduction of insurance costs. To hospital administrators it may mean "quality assurance." To safety professionals, it is reducing accidents and injuries. For military acquisition managers, it means identifying, prioritizing, and managing the technical, cost, and schedule risks inherent in developing a new weapon system.
This article discusses how an effective reliability program can be a valuable part of an overall risk management effort for military system acquisition programs.
What is Risk?
The American Heritage® and Webster dictionaries define the term similarly. These definitions can be summarized as:
- Possibility of suffering harm or loss: Danger.
- A factor, course, or element involving uncertain danger: Hazard.
- The danger or probability of loss to an insurer.
- The amount that an insurance company stands to lose.
- One considered with respect to the possibility of loss to an insurer (a good risk, e.g.)
A more general definition of risk, perhaps more appropriate for acquisition, is:
Risk is the chance that an undesirable event might occur in the future that will result in some negative consequences.
This latter definition of risk is often expressed as an equation (Reference 1):
Risk Severity = Probability of Occurrence x Potential Negative Impact
In the sense of the definition just given, risk is a part of everyday life. We all are faced with uncertainties in our lives, our careers, and our decisions. Since we cannot avoid such uncertainties, we must find ways to deal with them.
Similarly, the acquisition manager faces uncertainty concerning the technical challenges in designing a new system, and the cost and schedule estimates. Much effort is expended in trying to assess the technical challenges of a new program, in estimating the costs associated with that program, and in scheduling the program. In addition to the many constraints placed upon the manager, such as budgets, timeframes, and technical state-of-the-art, the uncertainties, or the risks, make the job of managing the program to a successful conclusion a difficult one.
Technical risk affects cost and schedule. As stated in an article in the Journal of Defense Acquisition University (Reference 2):
There is no dispute that there is a strong relationship between technical risk and cost and schedule overruns, nor is there any dispute that DoD Project Offices must assess and mitigate technical risk if they are to be successful. However, what must be kept in mind is that technical risk inand-of-itself does not directly result in cost and schedule overruns. The moderating variable is the manner in which a project's contract is crafted and how deftly the contract is administered, given the nature of a project's technical risk.
As an aside, in his 1999 thesis (Reference 3) written for the Naval Postgraduate School, James Ross identified poorly defined requirements as one of the highest risks during pre-solicitation. Without very clearly defined, justifiable, and realistic requirements, the already difficult task of risk management during program execution is even more difficult.
What is Risk Management?
One can compare the job of program management to that of a ship captain directing the safe passage of the vessel through waters filled with reefs and icebergs, sometimes fighting currents and foul weather. The captain may have navigation charts but they may be inaccurate or incomplete. Electronic navigation and communication equipment may be affected by interference or sun spot activity, and the vessel's power plant and other essential systems may have failures. It takes a lot of experience and great skill to cope with such risks and still bring the ship safely to its destination.
How does the acquisition manager navigate through the risks associated with a weapon system program? Risk management has always been an implicit part of the manager's job. It had its origins in the insurance industry, where the tools of the trade included actuarial tables. In the 1970s and 1980s, risk management started to gain recognition in other industries. Initially, the focus of risk management in these industries was similar to that of the insurance industry: protecting against catastrophe and evolved to protecting unaffordable potential losses. In the 1980s, total quality management had become formalized as a means for improving the quality of business processes. Today, modern risk management is widely implemented as a means of protecting the bottom line and ensuring long-term performance. For the acquisition manager, this translates into bringing a program in on schedule, on budget, and with all technical requirements satisfied.
Risk management consists of the following activities:
- Identify concerns.
- Identify risks and risk owners.
- Evaluate the risks as to likelihood and consequences.
- Assess the options for accommodating the risks.
- Prioritize the risk management efforts.
- Develop risk management plans.
- Authorize the implementation of the risk management plans.
- Track the risk management efforts and manage accordingly.
A variety of tools are available to help the acquisition manager. Some have evolved from the actuarial approach used in the insurance business. Others have been developed and tailored to specific fields of endeavor. These include Risk Ranking Tools, Probabilistic Risk Assessment (Reference 4), and Risk Software. NASA has developed a formal probabilistic risk assessment program (References 5-6).
Another important tool that can help in managing risk is an effective reliability program. This article explores the ways in which a reliability program can contribute to effectively managing risk. Although the focus is on managing risk in a military acquisition program, the discussion applies equally to the acquisition of any new product.
Managing risk may begin with acquisition but continues through the life of a system. Table 1 outlines some of the risk management activities associated with the three major phases of the life cycle.
Table 1. Risk-Related Objectives of Life Cycle Phases (Based on a Table in Reference 7)
| Life Cycle Phase |
Objective |
| Concept definition, design, and development |
· Identify major contributors to risk.
· Assess overall design adequacy.
· Provide input for establishing procedures for normal and emergency procedures.
· Provide input for evaluating acceptability of proposed hazardous facilities or activities. |
| Construction, production, installation, operation, and maintenance |
·
Gauge and assess experience to compare actual performance with relevant requirements.
·
Update information on major risk contributors.
·
Provide input on risk status for operational decision-making.
·
Provide input for optimizing normal and emergency procedures. |
| Disposal (decommissioning) |
·
Provide input to disposal (decommissioning) policies and procedures.
·
Assess the risk associated with process disposal (decommissioning) activities so that appropriate requirements can be effectively satisfied. |
What is an Effective Reliability Program?
An effective reliability program during system acquisition includes the following:
- A documented process for developing requirements that meet customer needs, are realistic, and achievable within budget and schedule constraints.
- Activities for designing for reliability. This includes the use of analytical techniques such as Failure Modes and Effects Analysis, Finite Element Analysis, Failure Analysis, and Root Cause Analysis. It also includes Robust Design Techniques.
- Testing conducted to identify failure modes; support reliability growth through improving reliability by identifying design weakness, analyzing these weaknesses, and changing the design to eliminate or minimize the effect of failures; and to validate whether or not the reliability requirements have been met.
- A strong quality assurance program during manufacturing and production to translate the design into an actual system with as much fidelity as possible. Such a program includes statistical process control and periodic testing to ensure that the last product off the line has the same level of reliability as the first.
An effective reliability program cannot stand alone; it must be incorporated into the overall systems engineering and design effort. Thus, activities such as configuration management and control, design trades, cost-benefits analysis, and so forth apply to the reliability effort as much as to other performance parameters. The results of reliability analyses and tests, in turn, can be useful to the safety analyst, maintenance planner, and logistics staff. They can also be used as part of an overall risk management effort.
In NAVSO P-3686 (Reference 8) dated October 1998, the importance of systems engineering to the management of technical risk is stated as follows.
The Integrated Process/Product approach to technical risk management is derived primarily from the Critical Process approach and incorporates some facets of the Product/work breakdown structure (WBS) approach. The systems engineering function takes the lead in system development throughout any system's life cycle. The purpose of systems engineering is to define and design process and product solutions in terms of design, test, and manufacturing requirements. The WBS provides a framework for specifying the technical objectives of the program by first defining the program in terms of hierarchically related, product oriented elements and the work processes required for their completion.
This emphasis on systems engineering, including processes and technical risk, along with process and product solutions, validates and supports the importance of focusing on controlling the processes, especially the prime contractor and subcontractors [sic] critical processes. Such a focus is necessary to encourage a proactive risk management program, one that acknowledges the importance of understanding and controlling the critical processes especially during the initial phases of product design and manufacture.
As an important part of the overall systems engineering approach, the reliability program can be a valuable contributor to the management of risk.
Reliability as a Risk Management Tool
Few people in acquisition debate the need for an effective reliability program as part of a risk management program. The question is how can such a program assist in risk management? Let us examine the various tools of reliability and see how they can be used to help identify, prioritize, and manage risk.
Analytical Reliability Tools.
These include the Failure Modes and Effects Analysis (FMEA), Fault Tree Analysis (FTA), Root Cause Analysis (RCA), Worst Case Analysis, and Sneak Circuit Analysis (SCA).
1. FMEA. The FMEA is an analytical tool used throughout the design process. It can be used to examine increasing levels of indenture, usually starting at the assembly level and progressing up. Briefly, the analysis is conducted to identify:
- The various functions of the item being analyzed.
- The possible ways that the item could fail to perform each of its functions (failure modes).
- The likelihood of each failure mode occurring.
- The effect, should the failure mode occur, on the item and system operation.
- The root cause of each failure mode.
- The relative priority of each failure mode.
- Recommended actions to reduce the likelihood, effect, or both of the failure modes, beginning first with the highest priority modes.
Different standards are available that define the FMEA process. Although they may differ in the details, they all include similar steps. One of these is some way of prioritizing failure modes. The old military standard (MIL-STD-1629) describes a Failure Modes, Effects, and Criticality Analysis (FMECA) in which failure modes are prioritized based on their relative criticality, a function of the probability of occurrence and severity of effect. The Automobile Industry Action Group (AIAG) standard uses a risk priority number, also based on probability of occurrence, severity of effect, and other factors.
Whether the FMEA process described in the AIAG standard, the FMECA process described in MIL-STD-1629, or the process as documented in other standards is used, they share the common element of prioritizing risk. As such, the FMEA/FMECA process is an excellent tool for identifying technical risk. By tracking recommended actions for high-risk failure modes, and ensuring that the recommended design (or other) changes are effective, the technical risk can be managed.
2. FTA. Whereas the focus of the FMEA is on a subassembly, an assembly, and so forth, the FTA focuses on a specific event, usually and undesired event (i.e., a failure). By creating what are known as fault trees, one can then trace all of the possible events or combinations of events that could lead to the undesired event.
Not only can the FTA directly contribute to identifying design risks, but it can also reduce risk during operation. By its very nature, the FTA can help in developing the diagnostics so necessary to the maintenance of a system. (The FMEA can also contribute to the development of diagnostics).
3. RCA. Given the limited funds and schedule facing each program manager, it is critical that item and money is not expended ineffectively. When high-risk failures occur during testing, design changes usually are required to reduce the risk to an acceptable level. This reduction is achieved by eliminating a failure mode, reducing the frequency with which the mode will occur, minimizing the effect of the mode, or some combination of these alternatives.
To arrive at an effective design change, the underlying cause of each failure must be determined. This underlying cause is not the failure mode. A failure mode, such as an open in a resistor, can be compared to a symptom. When we are ill, our doctor (we hope) does not treat our symptoms. Instead, the doctor tries to determine the underlying reasons for our illness. To do so requires experience, good judgment, and the use of diagnostic tools, such as X-ray, blood tests, and so forth.
Just as doctors search for the underlying cause of an illness, engineers must determine the underlying reason for a failure mode. These reasons are often referred to as failure mechanisms. They are the physics of failure. A primary tool used to identify these failure mechanisms is Root Cause Analysis (RCA). RCA is experience, judgment, and specific activities applied in combination. The activities include non-destructive and destructive inspections. Table 2 lists just a few of these activities conducted for RCA.
Table 2. Typical Activities for Determining Root Cause
- Physical examination of failed item
- Fracture mechanics
- Nondestructive evaluation
- X-ray
- Thermography
- Magnetic flux
- Penetrant dye
- Computerized tomography
- Ultrasonics
|
- Mechanical testing
- Macroscopic examination and analysis
- Microscopic examination and analysis
- Comparison of failed items with non-failed items
- Chemical analysis
- Finite element analysis
|
4. Worst-Case Analysis. As part of the reliability and design programs, analysis can be performed in worst case conditions to assure adherence to the specification requirements, reducing the risk of failure due to inadequate operating margins.
The design is examined to identify circuit tolerance to parameter drift of critical parts that may lead to out-of-specification conditions over the system's operating life.
The analysis demonstrates sufficient operating margins for the operating conditions of the circuits, taking into consideration:
- Parts parameter variations
- Initial tolerances
- Temperature
- Aging effects
- Radiation effects
- Power input line voltage variations
- Operational mode effects
- Circuit parameter variations due to loading & stimulus
5. SCA. Many system failures are not caused by part failure. Design oversights can create conditions under which a system either does not perform an intended function or initiates an undesired function. Such events in modern weapon systems can result in hazardous and even dire consequences. A missile, for example, may be launched inadvertently because of an undetected design error.
A significant cause of such unintended events is the "sneak circuit." This is an unexpected path or logic flow that, under certain conditions, can produce an undesired result. The sneak path may lie in the hardware or software, in operator actions, or in some combination of these elements. Even though there is no
"malfunction condition," i.e., all parts are operating within design specifications, an undesired effect occurs. Four categories of sneak circuits are listed in Table 3.
Table 3. Categories of Sneak Circuits
| Category |
Characteristics |
| Sneak Paths |
Unexpected paths along which current, energy, or logical sequence flows in an unintended direction. |
| Sneak Timing |
Events occurring in an unexpected or conflicting sequence. |
| Sneak Indications |
Ambiguous or false displays of system operating indications conditions that may cause the system or an operator to take an undesired action. |
| Sneak Labels |
Incorrect or imprecise labeling of system functions - e.g., system inputs, controls, displays buses - that may cause an operator to apply an incorrect stimulus to the system. |
Sneak circuit analysis is a generic term for a group of analytical techniques employed to methodically identify sneak circuits in hardware and software systems. Sneak circuit analysis procedures include Sneak Path Analysis, Digital Sneak Circuit Analysis, and Software Sneak Path Analysis.
Reliability Testing.
1. Reliability growth testing. The term Reliability Growth Testing (RGT) usually refers to a process by which the following three objectives are achieved:
- Failures are identified and analyzed.
- The design is improved to eliminate the failures, reduce the probability of their occurrence, reduce the effects of the failures, or some combination of these alternatives.
- The progress being made in the growth process is tracked with quantitative estimates of the reliability. Models, such as the Duane and AMSAA, are used for making the estimates.
When all three of these objectives are being pursued, the RGT is a formal program for achieving growth. Growth can also be achieved by analyzing the failures from any and all testing and developing design changes to address the failures. However, quantitative estimates of reliability may not be able to be made due to statistical limitations of combining data from different tests. For our purposes, we will refer to this latter process as Test-Analyze-And-Fix (TAAF).
Whether RGT or TAAF is used, the process of identifying and addressing failures helps reduce technical risk. The RGT also provides a quantitative means of assessing the risk of not meeting a specific reliability goal within budget and schedule.
2. Life testing (Reference 9). Every product and system consists of hundreds, perhaps thousands, or hundreds of thousands of parts. The system reliability depends on how these parts are connected together, how they are applied, and the reliability of each. Some parts may have little impact on system reliability due to their application. Others may be critical to the continued and safe operation of the system.
It is obvious that selecting the "right" parts is important. A
"right" part is one that:
- Performs the correct function
- Has sufficient reliability
- Meets other criteria such as support, obsolescence, and cost
Determining whether a part has the requisite reliability for a given application is an element of part characterization. Life testing is one method for characterizing a part from a reliability perspective. By testing a sample of parts, recording the times to failure for the parts, and analyzing these times to failure, the reliability of the population represented by the sample can be estimated2. As importantly, some insight into the category of failure (wearout, infant mortality, random failure3) can be gained. One common technique for analyzing the times-to-failure data is Weibull analysis (Reference 10).
Using life testing, engineers can determine if the reliability of each part is adequate4 and, if not, what changes might be necessary to attain the required level. By ensuring that the "right"
parts are selected, technical risk is reduced.
3. Validation testing. Validation testing helps confirm whether the efforts during design have paid off or not. It can be done at the part level or at higher level of indenture. The types of test often used for validation are listed in Table 4.
Table 4. Commonly Used Tests for Validation
| Level of Indenture |
| Parts |
Assembly and Higher |
· Weibull testing
· Attribute testing |
· Sequential testing
· Fixed length testing
· Attribute testing |
RGT, TAAF, and part characterization is done as part of the design process, a process in which the design is changing. There is no pass-fail criterion for such tests; the objective is to identify and address weaknesses in the design from a reliability perspective. Validation testing, on the other hand, is ideally done on the
"finished" design and is a "go-no-go" or "pass-fail" test. Validation testing provides the best measure of the level of reliability achieved before a full production decision is made.
When validation tests are included as part of the contractual requirements, it provides an added incentive to contractor and government alike to do the requisite engineering starting early in the program. Neither the customer nor the contractor wants the system to fail the validation test. Knowing that the test is a hurdle that must be passed provides incentive to control technical risk throughout the design process.
Production Reliability Testing. After the design is accepted, validation tests have been passed, and the production processes have been brought under control, Production Reliability Testing (PRT) may be conducted, especially when the production takes place over an extended period of time.
PRT is intended to detect any degradation in reliability performance that may result from changes in suppliers, design processes, configuration, and so forth. When degradation is detected, PRT provides an early warning so that corrective actions can be considered before a large number of systems are delivered to the customer. Many of the same techniques used for validation purposes can be used for PRT. Thus, PRT helps reduce the risk of sending systems with poor reliability to the customer. In addition, the earlier that problems are detected, the lower the cost to correct the problems.
Reliability Predictions and Assessments. Realistic and factbased assessment of the level of reliability being achieved at any point in time is an important element of a comprehensive reliability program. The need for quantitative measures has been alluded to several times in this article (likelihood of a failure mode, reducing the probability of occurrence of a failure, and tracking growth in a formal RGT program).
The author distinguishes between a prediction and an assessment in the following way. A prediction is usually thought of as the quantitative output of a model, such as a parts count model, reliability block diagram, or simulation. An assessment is an overall evaluation of the reliability based on the output of models, test results, engineering judgment, and consideration of any assumptions and the limitations of the models and testing. The subject is much too broad and involved to cover here in detail. Two points, however, are important to the subject of risk management.
1. Point estimates versus confidence intervals (Reference 11). Predictions based on models and testing can always be expressed as a single, or point value. The output of some types of models, empirical models for example, can only be stated as point values. Point estimates are probably the most common way that technical people communicate predictions and assessments to management.
The problem with a point estimate is that it incorrectly conveys certainty. When one states that the MTTF of a part is 10,000 fatigue cycles or that the MTBF of a subsystem is 2,200 operating hours, it is often interpreted in the same way as stating that the part is 3.5 cm long or the subsystem weighs 450 pounds. The latter two measures are deterministic and, within the limits of our measurement equipment and changes in temperature and humidity, do not vary from day to day.
Reliability, however, is a probabilistic concept. Reliability testing consists of testing samples. Even when several samples are taken from a population with a given distribution with known parameters, the parameters obtained from the sample testing vary in value. Given that the distribution of the population is never known, the variation in results from testing different samples can be very large. Thus, in accepting a point estimate as
"gospel," we run the risk of being optimistic. Worse yet, we have no idea what the level of risk may be.
For those cases where a statistical model or test is used, we can provide confidence bounds on the point estimate. A confidence bound can be either one-sided (i.e., we are X% confident that the interval from the lower bound to infinity includes the true reliability) or two-sided (i.e., we are X% confident that the interval from a lower bound to an upper bound includes the true value of reliability). Consider the following statements concerning an item for which the MTBF requirement is 950 hours.
- The estimate of reliability is 1,000 hours MTBF.
- The 90% confidence interval for reliability is 700 to 1,500 hours MTBF.
Which does a better job of indicating that the estimate is inexact and carries with it a risk of being wrong (i.e., the achieved MTBF is less than the requirement)? If the manager desires a smaller interval, he or she must either be willing to invest in additional testing or accept a higher risk of being wrong.
2. The Reliability Case (Reference 12). The Reliability Case is an example of an assessment. It is a progressively expanding body of evidence that a reliability requirement is being met. Starting with the initial statement of the requirements, the "Reliability Case" subsequently includes identified, perceived, and actual risks; strategies; and an Evidence Framework referring to associated and supporting information. This information includes evidence and data from design activities and in-service and field data as appropriate.
The Reliability Case provides an audit trail of the engineering considerations starting with the requirements and continuing through to evidence of compliance. It provides traceability of why certain activities have been undertaken and how they can be judged as successful. It is initiated at the concept stage, and is revised progressively throughout the system life cycle. Typically it is summarized in Reliability Case Reports at predefined milestones. Often, it is expanded to included maintainability (The R&M Case). The Reliability Case is developed using:
- Calculations
- Analyses
- Testing
- Expert opinion
- Simulation
- Information from any previous use
Each Reliability Case report lists and cross references the parent requirements in the Evidence Framework, against which the evidence is to be judged, and is traceable to the original purchaser's requirement. The body of evidence traces the history of reviews and updates of the reliability design philosophy, targets, strategy and plan, which keep these in line with the changing status of the original risks and any new or emerging risks. The status of assumptions, evidence, arguments, claims, and residual risks is then summarized and discussed. Clearly, the Reliability Case can be an important part of the overall technical risk management effort.
Conclusions
Risk is always with us; there is no escaping it. However, we can deal with risk and keep it at an acceptable level by managing it. We can manage risk by using a variety of tools to:
- Identify risks
- Evaluate them as to likelihood and consequences
- Assess the options for accommodating the risks
- Prioritize the risk management efforts
- Develop risk management plans
- Track and manage the risk management efforts
One of the tools available to the manager for specifically addressing technical risk is an effective reliability program. Many of the activities conducted to develop a system having the requisite level of reliability can directly contribute to the management of technical risk. These include:
- Analyses
- Tests
- Predictions and Assessments
- The Reliability Case
By implementing reliability as part of a systems engineering approach, the results of reliability-focused activities can contribute to the many other activities that take place in a system acquisition program. The systems engineering approach capitalizes on the synergy of coordinated and synchronized technical activities. By eliminating duplicative effort and making maximum use of the results of activities, the systems engineering approach by its very nature helps minimize risk. Reliability, implemented as part of the systems engineering approach, can play a significant role in risk management.
References
- "Understanding Risk Management," CROSSTALK, Software Technology Support Center, Ogden ALC, UT, February 2005, page 4.
- Bolles, Mike, "Understanding Risk Management in the DoD," Journal of Defense Acquisition University, Volume 10, 2003, pages 141-154.
- Ross, James P., "A Risk Management Model for the Federal Acquisition Process," Naval Postgraduate School Thesis, 1999, DTIC Document ADA368012.
- Foote, Andrew J., "Is Probabilistic Risk Assessment the Answer?," Journal of the Reliability Analysis Center (RAC), First Quarter, 2003.
- "The Inherent Values of Probabilistic Risk Assessment," Second NASA Probabilistic Risk Assessment Workshop, June 19, 2001.
- "Probabilistic Risk Assessment Procedures Guide for NASA Managers and Practitioners," Version 1.1, August 2002.
- Dhillon, B.S., Design Reliability, Fundamentals and Applications, CRC Press, NY, 1999.
- "Top Eleven Ways to Manage Technical Risk," NAVSO P-3686, October 1998.
- Kalbfleisch, J.D. and Ross, L., Prentice, "The Statistical Analysis of Failure Time Data, Wiley, New York, 1980.
- Abernethy, Robert, The New Weibull Handbook, 4th Edition, Published by R. Abernethy, 1996, ISBN 0 9653062 0 8.
- RAC-HDBK-1210: "Practical Statistical Tools for the Reliability Engineer," RAC, Rome, NY 1999.
- START 2004-2, "The R&M Case - A Reasoned, Auditable Argument Supporting the Contention that a System Satisfies its R&M Requirements," RAC, Rome, NY, 2004.