This is just an Excerpt from a larger document, click here to view the entire document.
A Markov Model for a Simple Redundant System

In Reference 8, we developed a statistical model for a non-maintained, simple redundant system, composed of two identical devices in parallel. The approach was based on the two RV, corresponding to the two device lives. In this section we also analyze a simple redundant system composed of two identical devices in parallel. The differences now are that we use a Markov Chain approach, and that system S is maintained and can function at a degraded level with only one unit. The advantages of Markov modeling of system Availability, as will become apparent from the numerical example that follows, increase as the system becomes more complex (as also do the mathematics behind the analyses involved).

Let, as before, X(T) be the state of the system at time T (= 0,1,2, ... hours). Let State 0 be the Down state, where both devices have failed and one of them is being repaired. Let State 1 be the Degraded state, where one device has failed and is being repaired and the second is working (and the system is operable but with lesser capabilities). Finally, let State 2 be the Up state, where both units are operating and the system is working at full capacity. The state diagram for this model is shown in Figure 3.

Figure 3. Markov Chain for Redundant System (Click to Zoom)

 The state equations are: p01 = P{X(T) = 1|X(T - 1) = 0} = q p10 = P{X(T) = 0|X(T - 1) = 1} = p p12 = P{X(T) = 2|X(T - 1) = 1} = q p21 = P{X(T) = 1|X(T - 1) = 2} = 2p pii = P{X(T)= i|X(T - 1) = i} = 1 - Σj≠1 pij

As before, we can consider every step (hour) T as an independent trial, having probability of success pij corresponding to the feasible transitions from our current state "i" into state j = 0,1,2. Hence, we can again think of the distribution of every change of state (produced by the occurrence of a failure or a repair) as being geometric, the discrete counterpart of the Exponential. It will have "probability of success" p = pij (corresponding to the change into that state) and a mean time to accomplishing such change of μ = 1/pij.

The transition probability matrix P for this model is given by:

 States 0 1 2 States 0 1 2 0 p00 p01 p02 0 1-q q 0 p = 1 p10 p11 p12 = 1 p 1-p-q q 2 p20 p21 p22 2 0 2p 1-2p

Rows must add to one (probability is unity because the system is always in one of its three states). And, if we want to know the probability pij (n) of being in some state "j" after "n" steps, given that we started in some state "i" of the system, we raise matrix P to the power "n" as we did before, and look at entry pij of the resulting matrix Pn. With the advent of modern computers and math software, these operations are no longer tedious or difficult.

Modify the numerical example of previous section, now using two units instead of one. The probability p of either unit failing in the next hour is 0.002. The probability q of the repair crew completing a maintenance job in the next hour is 0.033. Only one failure is allowed in each unit time period, and only one repair can be undertaken at a time.

With these new conditions, the probability that a degraded system (State 1) remains degraded after two hours is the sum of the probabilities corresponding to three events. First, that system status has never changed. Second, that one unit is first repaired and then another unit fails during the second hour. Third, that remaining unit fails in the first hour (the entire system goes down) then, a repair is completed in the second hour (system goes up, at degraded level):

P211 = [P x P]11 = p(2)11 = p10p11 + p12p21

= pq + (1 - p - q)2 + 2pq

= 0.002 x 0.033 + (1 - 0.035)2 + 2 x 0.002 x 0.033

= 0.9314

We are also interested in the mean time that the system spends in any given state. For example, System S can change to Up or Down, from state Degraded, in one step, with probabilities p and q. Hence, S will remain in the state Degraded with probability 1 - p - q. Then, on average, S will spend a "sejour" of length 1/{1 - (1 - p - q)} = 1/0.035 = 28.57 consecutive hours in the Degraded state, before moving out to either Up or Down states.

Let's now analyze "Availability at time T" = A(T) = P{S is Available at T}. But this just means that system S is not Down at time "T" (it can be Up or Degraded). In addition, S could have initially been Up, Down or Degraded. Hence, A(T) depends on the initial state of S (States 0,1,2), actual system availability level (States 1,2) and time (T). Assume we are interested in S being "Degraded Available" at T, given it was Degraded at T = 0: p11(T) . Since for matrix PT every row has to add to unit, we can obtain such Availability via:

We may instead be interested in "long run averages" or "state occupancies". These are the asymptotic probabilities of system S being in each one of its possible states at any time T, or the percent time spent in these states, irrespective of the state they were in, initially. These results are obtained by considering the Vector (denoted ) of "long run" probabilities:

Vector Π fulfills two important properties that allow the calculation of such values:

In plain English, × P = (Vector times the matrix P equals ) defines a system of linear equations, that are "normalized" by the second property (that probabilities in the components of Vector add to Unit). For our example, we have the following.

The solution of this linear system of equations yields the long run or asymptotic occupancy rates:

Π = (Π0, Π1, Π2 ) = (0.0065, 0.1074, 0.8861)

A Π2 = 0.8861 indicates that the system S is operating at full capacity 88% of the time. A Π1 = 0.1074 means that S is operating at a Degraded capacity 10% of the time. Only Π0, the probability corresponding to State 0 (Down state), is associated with the system being Unavailable. The "long run" system Availability is then: 1 - Π0 = 1 - 0.0065 = 0.9935.

Finally, we are also interested in the expected times for System S to go Down if initially S was in State Up (denoted V1) or Degraded (V2), or in the average time S spent in each of these states before going "Down". We obtain them by assuming Down is an "absorbing" state (one that, once entered, can never be left) and solving the linear system of equations leading to all such possible situations. That is, one step is taken at minimum (when the system goes Down, directly). If S is not absorbed in one step, then it will necessarily move on to any of other, non-absorbing (Up or Degraded) states, with the corresponding probability, and the process restarts.

V1 = 1 + p11V1 + p12V2 = 1 + 0.965V1 + 0.033V2

V2 = 1 + p21V1 + p22V2 = 1 + 0.004V1 + 0.996V2

Average times until System S goes down yield V1 = 4,625 hours (starting in state Degraded) and V2 = 4,875 (starting Up). For comparison, the non maintained system version referred to initially, would work an Expected 3/2λ = 3/0.004 = 750 hours in Up state, before going Down (Reference 8). The fact that maintenance is now possible, while S continues operating in a Degraded state (with a single unit), results in an increase of μ/22λ = 0.033/2 x 0.0022 = 4,125 hours in its Expected Time to go Down (from Up). Verify that the new Expected Time is due to the sum of Expected times to failures, plus maintenance: V2 = 3/2 + μ/22λ2 = 750 + 4,125 = 4,875.