Hazardous Events
By: Vito Faraci Jr., BAE Systems
Abstract
Calculating the probability of "undesirable events" often involves analyzing the various ways equipment can fail. Today, Fault Tree Analysis (FTA) is by far the most commonly used tool for qualitative and quantitative risk analyses. FTA was introduced in 1962 at Bell Labs, and for about twenty years was the "de facto" standard of the engineering community. Starting in the early 80s, a group of NASA mathematicians performed studies that clearly exposed some very subtle FTA limitations. In an effort to overcome these limitations, NASA developed algorithms using Markov Analysis (MA), designed not necessarily to replace, but to support FTAs.
MA was introduced in 1907 by a Russian mathematician by the name of A.A. Markov. It is interesting to note that although this knowledge has been around for some time, it is only recently that the engineering community has taken advantage of this science. For example, within the past three years or so, NASA has been using Markov methods for probabilistic risk assessments for the Shuttle systems. In addition, FTA and Reliability Software manufacturers have integrated Markov techniques into their risk assessment software programs.
With respect to reliability and risk assessment, the integration of MA with FTA has been a giant step forward. Engineers can now solve more accurately a larger set of "risk" problems than they could before. However, due to a lack of documentation written in a clear common language, knowledge of MA still remains a little "sketchy" within the engineering community.
Objective
This article is not intended to be a "how to solve"
tutorial even though it will reveal some details. Its objective is simply to raise the level of awareness of Markov Analysis, what it is, why it is required, and what it does. To this end, several illustrative examples will be presented. The topics to be discussed are:
- Combinatorial vs. Non-combinatorial Logic
- 2. Fault Tree Advantages and Limitations
- Why Markov Analysis?
- Markov Analysis Compared with FTA
- Combinatorial & Non-combinatorial Type Problems
Combinatorial vs. Non-combinatorial Logic
Combinatorial Logic. Combinatorial logic is characterized by the following.
- Two or more input states define one or more output states. Output states are related by defined rules that are independent of previous states.
- Logic depends solely on combinations of inputs.
- Time is neither modeled nor recognized.
- Outputs change when inputs change irrespective of time.
- Output is a function of, and only of, the present input.
Non-combinatorial Logic (Sequential Logic). Logic of output(s) depends on combinations of present input states, and combinations of previous input states. That is, non-combinatorial logic has memory; combinatorial logic does not.
Fault Tree Advantages
The chief advantages of a fault tree are that it:
- Acts as a visual tool which can be used to pinpoint system weaknesses.
- Exhibits clear representation of logical processes that lead to a system or sub-system failure (clear qualitative representation of failure propagation).
- Reveals relatively simple equations for probability of failure (Pf) calculations yielding quantitative analyses that do not require high powered math.
- Is a very effective tool for the fault isolation process.
Fault Tree Limitations
The following is an excerpt from Aerospace Recommended Practices ARP4761 Issue 1996-12.
- [In a fault tree it is] Difficult to allow for transient & intermittent faults or standby systems with spares.
- If a system has many failure conditions, separate fault trees may need to be constructed for each one.
- [In a fault tree it is] Difficult to represent systems where failure rates or repair rates are state dependent.
The following is an excerpt from NASA Ref. Publication 1348:
Traditionally, the reliability analysis of a complex system has been accomplished with combinatorial mathematics. The standard fault-tree method of reliability analysis is based on such mathematics. Unfortunately, the fault-tree approach is somewhat limited and incapable of analyzing systems in which reconfiguration is possible. Basically, a fault tree can be used to model a system with:
- Only permanent faults (no transient or intermittent faults).
- No reconfiguration.
- No time or sequence failure dependencies.
- No state-dependent behavior (e.g., state-dependent failure rates).
Why Markov?
The following is another excerpt from ARP4761 Issue 1996-12 regarding fault trees and Markov Analysis.
- MA does not have these limitations.
- Sequence dependent events are included and handled naturally.
- Covers a much wider range of system behaviors.
Close examination of the NASA and ARP excerpts reveals the practical answer to the "Why Markov" question. It basically has to do with combinatorial vs. non-combinatorial type problems. FTA methods can only approximate and cannot precisely calculate solutions to non-combinatorial type problems. Markov techniques give us the ability to more accurately calculate solutions to non-combinatorial type problems.
Some Pros and Cons
Fault Tree Analysis handles combinatorial type problems both qualitatively and quantitatively extremely well, but has difficulty with non-combinatorial problems in both areas. Markov Analysis handles non-combinatorial as well as combinatorial problems, but may not be quite as intuitive as FTA, and usually requires some higher power math for the quantitative analyses.
Introduction to Markov Analysis
If a system or component can be in one of two states (i.e., failed, non-failed), and if we can define the probabilities associated with these states on a discrete or continuous basis, the probability of being in one or the other at a future time can be evaluated using a state-time analysis. In reliability and availability analysis, failure probability and the probability of being returned to an available state are the variables of interest. The best known statespace technique is Markov Analysis.
Example Markov State Diagram
Figure 1 illustrates a Markov state diagram. The following principles apply to this and all such diagrams.
- Various system states are represented.
- A transition rate is a function of the failure or repair rate.
- States are mutually exclusive.
- The sum of the probabilities must equal 1.
Figure 1. An Example of a Markov State Diagram: (Click to Zoom)
Markov Analysis Compared with FTA
Different types of failure rate characteristics are not an issue. FTA and Markov methods can handle both constant and nonconstant failure rates. The major factor that sets the two methods apart is in the handling of combinatorial and/or non-combinatorial type problems.
As mentioned before Markov handles combinatorial as well as non-combinatorial problems. Although there is no need of Markov for solving combinatorial type problems, (FTA handles them well enough) the next few examples will be demonstrated for the sake of illustration and comparison.
For purposes of simplification, the examples that follow will be limited to "constant failure rate" type problems. Solutions to
"non-constant failure rate" type problems require somewhat different techniques and need to be discussed separately.
Combinatorial Type Problems
Problem 1: 3 Components in Parallel (Combinatorial). Three black boxes start operation at the same time. Box A, B, and C have failure rates a, b, and c respectively. Successful system operation requires that Box A, B or C be functional. Find Pf. Note that the results shown in Figures 1A and 1B from each method, FTA and MA, are identical.
Figure 1A. Markov Model for Problem 1: (Click to Zoom)
Figure 1B. FTA Approach for Problem 1: (Click to Zoom)
Problem 2: Fault-Tolerant Diode Circuit, Probability of Circuit Short (Combinatorial). The diode circuit shown in Figure 2A is a model of a fault-tolerant diode configuration. The two possible failure modes for a diode are a SHORT circuit or an OPEN circuit. The failure rate for the SHORT mode (assuming identical diodes) is . Derive the equation for the probability of a "SHORT." Let a, b, c, and d = = failure rates of failure mode SHORT for diodes A, B, C, and D respectively. The Markov model for analyzing the diode circuit in Figure 2A is shown in Figure 2B. The FTA model is shown in Figure 2C.
Figure 2A. Diode Circuit Model 1: (Click to Zoom)
Note that Markov and FTA results are again the same since this is a combinatorial problem.
Figure 2B. Markov Model for Problem 2: (Click to Zoom)
Figure 2C. FTA Model for Problem 2: (Click to Zoom)
Problem 3: Fault-Tolerant Diode Circuit, Probability of Circuit Open (Combinatorial). The diode circuit in Figure 3A is a model of a fault-tolerant diode configuration. The two possible failure modes for a diode are: a SHORT circuit or an OPEN circuit. The failure rate for the OPEN mode (assuming identical diodes) is . Derive the equation for the probability of an OPEN circuit. Let a, b, c, and d = = failure rates of failure mode OPEN for diodes A, B, C, and D respectively.
Figure 3A. Diode Circuit Model 2: (Click to Zoom)
Figure 3B. Markov Model for Problem 3: (Click to Zoom)
Figure 3C. FTA Approach for Problem 3: (Click to Zoom)
Note that the Markov and FTA results are again the same since this is a combinatorial problem.
Each of the solutions in Table 1 can be expressed in terms of integral sums and products of their respective probabilities of successes or failures. In other words, coefficients and exponents of terms in column 3 of the table will all be integers. This is a telltale characteristic of all combinatorial type problems.
Table 1. Combinatorial Problem Summary Chart
Non-combinatorial Type Problems
Solutions to non-combinatorial problems require different techniques other than traditional combinatorial logic such as that found in FTAs. These methods include solving a set of simultaneous differential equations (DEs), Laplace Transforms, Convolution, and State Sequence Methods. These methods are subjects for another article. One non-combinatorial type problem in particular that has intrigued mathematicians for quite some time is the classic
"Standby Problem."
Note: For purposes of simplification, the following comparison examples will be limited to "constant failure rate" type problems. Solutions to "non-constant failure rate" type problems require somewhat different techniques and require a separate discussion.
Problem 4: 2 Components Standby Redundant (Non-combinatorial): Box A has failure rate "a" and Box B has failure rate "b."
Box A is powered on while Box B remains off. Immediately upon detection of Box A failure, Box B is powered on. Calculate the probability that both boxes fail. The solution using Markov analysis and FTA, are shown in Figures 4A and 4B respectively.
Figure 4A. Markov Model for Problem 4: (Click to Zoom)
Figure 4B. FTA Approach for Problem 4: (Click to Zoom)
Figure 4C. Graph a Standby Markov, FTA Comparison (0 to 10 hours): (Click to Zoom)
Figure 4D. Graph b Standby Markov, FTA Comparison (0 to 5,000 hours): (Click to Zoom)
This problem is another example of sequence failure dependency, and therefore a non-combinatorial type problem. Note again the FTA having difficulty tracking the Markov solution. However, for the first ten hours, the solutions are almost identical as shown in Figure 4C. However, as shown in Figure 4D, the FTA error becomes quite apparent as t gets large. In this example the MA results are larger than FTA. However, it is important to note that this is not always the case. In other problems, FTA results will exceed MA. In other words, the results can go either way.
Problem 5: Two components in Parallel with Required Order Factor (ROF) (Non-combinatorial). For this example, what is:
- The probability that both Boxes fail, and that A fails before B?
- The probability that both Boxes fail and that B fails before A?
The Markov model for this example is shown in Figure 5A.
Figure 5A. Markov Model for Problem 5: (Click to Zoom)
The Markov solution equations are:
- P(4) = a/(a + b) + [b/(a + b)] e-(a + b)t - e-bt
- P(5) = b/(a + b) + [a/(a + b)] e-(a + b)t - e-at
The fault tree for this example is shown in Figure 5B.
The fault tree equations are:
- Pf = ˝xy = ˝ (1 - e-at)(1 - e-bt)
- Pf = ˝xy = ˝ (1 - e-at)(1 - e-bt)
Figure 5B. Fault Tree Approach for Problem 5: (Click to Zoom)
Figure 5C. Graph a ROF Markov, FTA Comparison (0 to 10 hours): (Click to Zoom)
Figure 5D. Graph a ROF Markov, FTA Comparison (0 to 1,000 hours): (Click to Zoom)
Recall Item 3. of the NASA Excerpt. This ROF problem has a sequence failure dependency, and consequently a non-combinatorial type problem. As one can observe, the results for the FTA and Markov model are not the same. The difference is due to the fact that FTA has difficulty handling such problems. Figures 5C and 5D show the FTA error.
Problem 6: One Component with Repair (Non-combinatorial). A Black Box has failure rate "a" and an average repair rate "b."
Immediately upon detection of a failure, the Box goes into a repair process and is put back on line. Calculate the probabilities of States 1 (Full Up) and 2 (Fail). For this example,
- "Repair" can be categorized as an intermittent type problem. The device works, then it doesn't, then it works again. Recall Item 1. of the NASA Excerpt. Hence another example of a non-combinatorial problem.
- Markov has the capability of solving this problem on a continuous basis as shown in Figure 6.
Figure 6. Markov Model for Problem 6: (Click to Zoom)
The solution equations are:
Note from these equations, when t gets large P(1) approaches the value b/(a + b) which is commonly known as "Availability."
Problem 7: Load Sharing (Non-combinatorial). Consider a parallel load-sharing system consisting of two components A and B. Under the load sharing conditions, each component carries one-half of the load. If under half-load conditions, the failure rate for each component is one-third of the full load failure rate. The full-load component failure rate is "a." Figure 7A shows the Markov Model for this example.
Figure 7A. Markov Model for Problem 7: (Click to Zoom)
This is a very interesting problem. At first glance this problem appears to be combinatorial since its Markov Model looks very much like the Markov Model of Problem 3. Construction of an equivalent model, Figure 7B, reveals that it is non-combinatorial.
Figure 7B. Equivalent Markov Model for Problem 7: (Click to Zoom)
Problem 7: General Solution for n > 1 and n ≠ 2. If the previous problem read "If under half-load conditions, the failure rate for each device is 1/n times the full load failure rate," the solution would be:
Problem 7: General Solution for n = 2
Problem 8: Reconfiguration (Non-combinatorial): A system is made up of three computers with each computer having failure rate "a." Upon detection of failure of any one of the three, the remaining two reconfigure themselves at rate "b," and continue operating. Upon detection of a second failure, the remaining one reconfigures itself at rate "b," and continues operating until it fails. Note that if a computer should fail before a reconfiguration is completed, the system fails. Find Pf. The Markov model for this problem is shown in Figure 8.
Figure 8. Markov Model for Problem 8: (Click to Zoom)
The probability of a system failure is the probability that the system enters state 3 or 6 or 8, and therefore Pf = P(3) + P(6) + P(8). Note: As per Item 2. of the NASA Excerpt "Reconfiguration" is another example of a non-combinatorial problem.
Problem 9: Function Failure Undetected (Non-combinatorial): A certain system incorporates Built-In-Test (BIT) which detects 90% of function failures of an electrical device. The function has failure rate "f," and BIT has failure rate "b."
Assuming the function and BIT are checked during preflight, what is the probability of the function failing undetected? Figure 9A shows the Markov model and solution; Figure 9B the FTA model and solution.
FTA Solution:
- x = Prob (Function fails undetected) = 0.1 * f * t
- y = Prob (Function fails detected) = 0.9 * f * t
- z = Prob (BIT fails) = b * t where t is the elapsed time measured with pre-flight being start of count.
- Pf = x + yz/2 xyz/2 = x + yz(1-x)/2
- Pf = 1 [e-.1ft + e-ft + e-(.1f+b)t - e-(f+b)t]/2
Figure 9A. Markov Model and Solution for Problem 9: (Click to Zoom)
Figure 9B. FTA Model and Solution for Problem 12: (Click to Zoom)
Figures 9C and 9D, show graphs for the Undetected Failure problem, in which the Markov and FTA solutions are compared (0 to 100 hours and 0 to 10,000 hours).
Non-combinatorial Problem Summary Chart: Selected examples of non-combinatorial problems are summarized in Table 2.
Figure 9C. Graph for the Undetected Failure Problem, Markov vs. FTA (0 to 100 hours): (Click to Zoom)
Figure 9D. Graph for the Undetected Failure Problem, Markov vs. FTA (0 to 10,000 hours): (Click to Zoom)
Table 2. Summary of Selected Examples of Non-combinatorial Problems
Note: Solutions to non-combinatorial problems cannot be expressed in terms of integral sums, products, and exponents of their respective probabilities of successes or failures. Notice in column 3, the coefficients and exponents of terms are not all integers. This is the telltale characteristic of non-combinatorial type problems.
Methods for Solving Markov (Non-Combinatorial) Problems
Methods for solving non-combinatorial type problems include solving a set of simultaneous differential equations (DEs), Laplace Transforms, Convolution, and State Sequence Method.
Conclusions
- In the world of Risk Analyses (calculating probability of failure), there exists problems which by nature are non-combinatorial as well as combinatorial.
- Analysts need to recognize, and be able to distinguish between both combinatorial and non-combinatorial type problems.
- Analysts should have the tools to solve both types qualitatively and quantitatively.
- Since FTA is easy to understand, very well known, and handles combinatorial problems very well, it is suggested that the analyst continue to use FTA whenever dealing with combinatorial types.
- It is suggested that MA not be used as a substitute for FTA, but rather as a supplement whenever non-combinatorial type problems are encountered.
References
- Aerospace Recommended Practices ARP4761 Issue 1996-12.
- NASA Ref. Publication 1348.