|
|
| RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center and Operated by IIT Research Institute
INSIDE
T h e J o u r n a l o f t h e
7
Industry News
12
RAC Introduces Its
New Professional
Sustainment
Program Series
Insert
Training Flyer
13
Program Managers
Handbook
Common Practices
to Mitigate the Risk
of Obsolescence
19
PRISM Column
21
Future Events
22
From the Editor
Reliability Analysis Center
Fourth Quarter - 2002
Introduction
The complexity of modern digital systems, made
more problematic by the microelectronics revo-
lution, is a two-edged sword. Complexity cre-
ates both the need for fault tolerance and the
means to implement various solutions.
With
fault-tolerant computing, correct output is
obtained in spite of the presence of internal errors
called faults. Any system with redundant com-
ponents or functions can be made fault-tolerant.
For the designer who keeps fault-tolerance in
mind during the early stages of design, many
techniques are available: error-correcting codes
added to input, internal, and output data streams;
parallel, standby, or voting redundancy schemes;
and where applicable, parallel paths through a
network. Software also can be made redundant if
one employs two independent software teams to
develop different yet functionally identical ver-
sions of the software; however, any common
errors will limit the redundant effect.
Availability.
Redundancy also provides an
opportunity to significantly increase the reliabili-
ty, the probability of no system failure, by intro-
ducing the possibility of repair during system
operation.
Furthermore, the availability, the
probability that the system is up, is often com-
puted from the ratio of (uptime)/(uptime + down-
time), which also measures the beneficial effects
of repair. The analysis of system availability
generally involves the formulation and solution
of a Markov probability model.
The use of
Laplace transforms and appropriate use of sim-
plifying theorems greatly facilitates the analysis
effort and provides insight into system behavior.
The design of a large system often is driven by
functional requirements; however, when reliabil-
ity considerations are included, the apportion-
ment of redundancy throughout the system can
be used as a means of optimizing redundancy.
There are several simple, heuristic apportion-
ment schemes as well as a new technique of reli-
ability optimization, which uses upper and lower
bounds to limit the computations to a modest
number of feasible solutions.
Typical System Reliability and Availability. The
reader can appreciate some of the gains in system
reliability and availability achieved in the last
half of the 20th century by studying the compar-
ison in Table 1.
Coding
Everyone knows the cliché describing computer
systems Garbage in, garbage out. One might
describe error-correcting codes as Garbage in,
correct information out. This is accomplished by
adding extra (redundant) bits to the input that act
as error checks. Schemes such as adding redun-
dant bits are used to detect and correct errors in bit
transmissions that occur in communication
(e.g., modems, satellites, etc.), memory reads and
writes in a CPU, and computer network messages.
One can assume that errors in transmitted binary
words occur due to noise that affects individual
bits in a word, causing errors, i.e., zero bits cor-
rupted to ones or one bits corrupted to zeroes.
Assume that noise is rare and occurs in short
pulses with duration less than the width of a
transmission pulse. The simplest technique for
error checking is to transmit each word twice,
and then sequentially compare each bit in the
first transmitted word with the corresponding bit
in the second transmitted word. If a given bit is
the same in both words, the bit is passed. If not,
an error is detected and word can be retransmit-
ted until two identical copies are then obtained,
correcting the error. Clearly, this scheme at least
By: Martin L. Shooman, Polytechnic University, Brooklyn, NY
and Martin L. Shooman & Associates
Reliability of Fault-Tolerant Computing Systems
and Networks
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
2
doubles the length of each transmitted word (i.e., 100% or more
overhead is required for error detection). More efficient meth-
ods exist with less overhead.
Parity Bit Codes. The simplest scheme is to append a single bit
called a parity bit to the word. If an even or odd parity scheme is
chosen, the parity bit is set to 1 or 0 so that the number of 1s in the
word (including the parity bit) is an even or odd number respec-
tively. The parity bit is generally computed using a tree network
of EXOR gates. The information word with parity bit attached is
then transmitted and when the word is received, another EXOR
tree is used to check that the parity is still even or odd. Clearly this
scheme detects single, triple, and any odd number of bit errors. If
the scheme is applied to an 8-bit byte, the parity bit is a ninth bit
and the overhead due to checking is 12.5%. Parity bit error detec-
tion codes have such small overhead that they are used in a wide
variety of applications. There are inexpensive, standard code gen-
eration (coder) and checking (decoder) digital circuits (74LS280).
Of course the code does not detect double, quadruple, or any
even number of errors since the parity is not changed in these
cases. The probability of an undetected error is the probability
that an even number of errors occur, which can be approximated
by the most significant term the Binomial probability of two
errors out of nine bits, which is given by 36q2 (1 - q)7 where q is
the probability of bit error. Without parity bit checking, all errors
are undetected and the binomial probability of one or more
errors in 8 bits is given by 1 - (1 - q)8. The ratio of these two
expressions gives the error reduction factor due to parity codes
that can be as large as 2 x 107 when q = 10-8. This analysis neg-
lects the possibility that an undetected error can occur due to a
failure of the coding or decoding circuitry. When these failure
probabilities are included in the analysis, the error reduction fac-
tor is decreased, especially for low transmission rates and small
values of q. For such values one has to wait many days or years
before multiple errors occur and the probability of chip failure
during this time period becomes significant. The results for a
parity bit code are shown in Figure 1, where the downward curve
at the top left of the diagram is due to chip failures.
Hamming Codes. The parity bit code is the least complex of the
many code types commonly used and is the simplest member of
an entire family of codes developed by Hamming [Reference 1]
that use extra check bits to detect multiple errors and correct
some errors. Although the more complex Hamming codes have
higher overhead, their error-correcting ability accounts for their
wide usage. Thus, Hamming Single Error Correcting and Single
Error Detecting (SECSED) and Hamming Single Error
Correcting and Double Error Detecting (SECDED) codes are
popular. All Hamming codes assume that the probability of mul-
tiple errors in a transmitted word is small; however in the case of
many storage devices, such as CDs, CD-ROMs, DVDs, hard and
floppy disk drives, errors often occur in bunches called bursts.
Bursts require a different formulation and the Reed-Solomon
(RS) code is commonly employed, [Reference 5].
Redundancy
Four major approaches are used to improve system reliability:
(1) simplify the system to use fewer components or simpler algo-
rithms and software; (2) lower component and software failure
rates; (3) provide redundant components or logical paths through
the system; and (4), for redundant components, repair failed
hardware and software components restoring full redundancy. In
this section, we will assume that the system designer has taken
advantage of (1) and (2) although this is not always the case in
practice, and focus on (3) and (4). One can view the material
covered in the last section of this article on apportionment and
redundancy optimization as an extension of redundancy as
described in this section.
Table 1. Comparison of Reliability and Availability for Various Fault-Tolerant Systems (Reference 3, pages 16, 17)
Application
Reliability
Availability*
1964 NASA Saturn Launch computer
R(250 hr.) = 0.99
------
Apollo NASA Moon Mission
R(mission) = 15/16 = 0.938
------
NASA Space Shuttle
R(mission) = 109/110 = 0.991
Through flight 110 in April 2002.
Note: Challenger flight exploded on 1/28/86.
------
Bell Labs ESS Telephone Switching System,
1966
------
Requirement of 2 hr. of downtime in 40 yr.
0.9999943
Tandem computer goals in 1980s
------
Tandem goals 0.999996
Stratus computer (web site quote 2001)
------
Stratus quote 0.9999905
Vintage 1985
Single CPU, transaction-processing
------
0.997 [Reference 4, p. 586]
Vintage 1985
Two in parallel CPU, transaction-processing
------
0.999982
Vintage 1985
Two in standby CPU, transaction-processing
------
0.9999911
*Note that the ESS availability goal of 40 years is clearly a steady state availability. Although not stated, the Tandem, Stratus, and Vintage 1985 num-
bers as well as the derived parallel and standby system availabilities are undoubtedly steady state values.
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
3
In general redundancy improves system reliability (and availabil-
ity) because it provides alternate paths through the system so that
if one fails, there is another that continues operation. Redundancy
was prevalent during the era of analog circuitry in the mid 20th
century. However, with the advent of digital circuitry in the later
half of last century, it became easy to implement a wide variety of
simple as well as complex redundancy schemes.
System and Component Redundancy. Most systems are com-
posed of components. Thus, reliability can be considered at a
system or component level. Consider, for example, a system
composed of two components A and B that must both work for
successful operation. If one employs system redundancy, a sec-
ond system A'B' is connected in parallel so that if either system
fails, the second system takes over. If components A and B have
an equal probability of failure q, then a single system has an
approximate probability of failure of 2q. In the case of system
redundancy, both AB and A'B' must fail for the system to fail and
if the two replicas are identical the approximate probability of
failure is (2q) x (2q) = 4q2.
However, there is another possibility of using component redun-
dancy, i.e., to parallel A and A' and parallel B and B'. In such a
case, the failure probability of the AA' cluster is approximately
q2 and that of the BB' cluster is approximately q2 and the failure
of either cluster 2q2. Thus, the reliability using system redun-
dancy is superior to a single system (2q > 4q2), and the reliabili-
ty using component redundancy is still better (2q > 4q2 > 2q2).
Detailed analysis verifies this result.
Thus far, we have neglected the device needed to connect items in
parallel, which can be termed the coupling device. A complete
analysis of the two redundancy schemes just discussed requires
including the reliability of the coupling device. Clearly, the relia-
bility of the coupling device appears as a series element and when
system reliability is used one coupling device is required and the
reliability of the coupler is rc. If there are five major components
in the system and component redundancy is used, the reliability of
the system of couplers is (rc)5. If coupler unreliability is signifi-
cant, it can erode the advantages of component reliability over
system reliability, and a detailed analysis is needed.
Standby Systems. In a parallel system the extra redundant sys-
tem(s) or component(s) is powered up and can fail although only
one system or component is needed for system operation. A
standby system improves on this situation by using one on-line
system or component to provide operation and keeping the
power off for the standby elements. As long as the non-powered
standby elements have very low or zero failure rates, the stand-
by system is superior to the parallel system. The reliability of a
single element with a constant failure rate l is rs = e-
lt
and that of
a parallel system is rp = 2e-
lt
- e-2
lt
. One can show that the relia-
bility of a standby system is rs = e-
lt
+ lte-
lt
and that it is always
better than a parallel system.
However, the reliability of the coupler can have a significant
effect on the results. In the case of a standby system, the cou-
pler is more complex than in a parallel system. In essence it is
a smart switch, which detects the failure of the on-line unit, pow-
ers up the standby unit, and switches the inputs and outputs to
the standby unit. A simple comparison of the two systems can
be made by neglecting the coupler effect in a parallel system and
assuming that the reliability of the switch in a standby system is
given by e-
lSt
and that it multiplies the expression for rs. A com-
parison of the two expressions is given in Figure 2. This analy-
sis shows that the standby system is superior to the parallel sys-
tem as long as the failure rate of the switch is 0.5l or smaller.
Figure 1. Undetected Error Improvement Ratio for a Single Byte with a Parity Bit Versus Error Probability q
and Transmission Rate B in Bits per Second (Reference 3, Figure 2.5, page 45)
10-8
10-7
10-6
10-5
107
106
105
Bit Error Probability, q
B = infinity
B = 56000
B = 9600
B = 1200
B = 300
Improv
eme
ntRat
io
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
4
Repair. Repair can further improve reliability in a system with
redundancy when repairs of failed redundant paths can be made
while the system continues to operate. The beneficial effects of
repair in the case of a parallel or standby system are obvious.
When one item fails, it is repaired while the other item continues
operation. The system only fails if the remaining unit goes down
before the failed unit is repaired and placed back in service.
Thus, it is a race to see whether repair occurs before a second
failure happens. However, the repair rate is much larger than the
failure rate in all practical cases; thus, repair continues to win for
many cycles and the time to system failure is greatly increased
by repair. For example, in the case of an automobile, we might
estimate that the failure rate (unscheduled maintenance) is once
per year. However, the repair plus waiting time is typically
about one day yielding a repair rate of 365 repairs per year.
Markov Models. For a complete analysis of repair effects, we
must make a Markov model for the system with repair. The first
step is to identify the system states. As an example let us con-
sider two elements with repair (either parallel or standby). The
elements are one and two and the state where both are good is S0
= x1x2 where both are bad S2 =
1 2
and where one is good S1
=
1
x2 + x1
2
. These three states are shown as nodes in Figure
3. The arcs between states, which go from left to right, represent
failures and are labeled with failure rates l and the arcs directed
from right to left represent repair and are labeled with repair
rates m. The probability of being in each of the states is obtained
by solving a first order differential equation for each probability
and, since the equations are coupled, the solution becomes that
of a third order differential equation. The Laplace transform
method can be used to simplify the solution work, however
much algebra is still involved. However, if we are satisfied with
less than the complete solution three Laplace transform theorems
for the initial value, the final value, and the integral of a function
can be used to easily find: (a) the mean time to failure, (b) the
final value, and (c) the Taylor series approximation for the initial
behavior. If we are only interested in the system mean time
between failure, MTBF, we obtain the results given in Table 2.
Note that if the repair rate is 10 times the failure rate, a standby
system has 12 times the MTTF of a single element, which is 6
times better than without repair. For practical systems where the
repair rate greatly exceeds the failure rate, very large gains in
system MTTF are realized.
Figure 3. A Markov Reliability Model for Two Identical
Parallel Elements and k Repairmen (Reference 3, Figure 3.14,
page 112)
Table 2. Comparison of the MTTF for the Various Systems
Represented by Figure 3 (Reference 3, page 115)
Figure 2. A Comparison of a Two-Element Ordinary Parallel System with a Two-Element Standby
System with Imperfect Switch Reliability (Reference 3, Figure 3.13, page 110)
1.0
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Normalized Time t = l t
Standby ls = 0.5l
Standby ls = l
Standby ls = 0
(perfect switch)
Standby ls = 0.1l
Parallel
Reliability
R(t)
x x
x
x
1- l' Dt
1- (l + m') Dt
1
S0 = x1 x2
S1 = x1 x2 + x1 x2
S2 = x1 x2
l Dt
l' Dt
m' Dt
Element
Formula
MTBF For l = 1, m = 10
Single
1/l
1.0
2 parallel no repair
1.5/l
1.5
2 standby no repair
2/l
2.0
2 parallel repair
(3l + m)/2l2
6.5
2 standby repair
(2l + m)/l2
12.0
where:
l' = 2l (ordinary system)
l' = l (standby system)
m' = m (one repairman)
m' = km (k repairman)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
5
Related Concepts. Space does not allow a full description of
many related issues; however, a few will be briefly discussed.
Availability: Reliability measures the probability that a sys-
tem is always up, working in interval 0 to t. However, cases
when a system goes down for a few minutes and is quickly
repaired are considered examples of good availability.
Availability is the probability of being up at time t (even if it has
failed and been repaired one or more times) and can be calculat-
ed from a Markov model such as the one shown in Figure 3, if
an additional repair branch is added from state two (S2) back to
state one (S1).
Coverage: In a standby system the coupler must sense
failure of the on-line element and switch in the standby one; oth-
erwise the system fails. Since the coupler cannot detect all fail-
ures, one should model the fraction detected (called coverage) in
a Markov analysis. Typical values for coverage are 0.90 and
0.95. Low coverage can significantly decrease the improvement
in reliability achieved with redundancy.
N-Modular Redundancy: The problems in designing a reli-
able coupling device in a parallel or standby system can be
avoided if N-Modular redundancy is used. In the most common
case, N = 3, and the method is called TMR, triple modular redun-
dancy, where three copies of a digital circuit are employed and
their outputs are compared in a simple digital circuit that pro-
duces an output agreeing with the majority. Thus, the system
succeeds if there are no failures or one failure, the reliability is
better than a single component in the high reliability region, (lt
is small), and the availability can be high with repair.
NonStop Computing: Many computing applications require
very high levels of reliability and availability and Tandem com-
puters (now a division of Compaq) coined the term NonStop
computing in the 1980s to appeal to the on-line transaction pro-
cessing market. Their system utilized software and hardware
redundancy, as did their competitor Stratus, to achieve high lev-
els of availability, (see Table 1).
RAID: Data storage in modern computer system is general-
ly implemented using a Redundant Array of Independent Disks
(RAID). Many RAID techniques employ redundant disks and
error-correcting bits applied to the data and spread over many
disks. These techniques utilize 50 to several hundred disks and
achieve huge storage densities and high reliability.
Networks
Much of the world is interconnected by a network(s). For many,
this is the Internet at home and a local area network at work link-
ing computers, printers, workstations, and the Internet. Several
methods can be used to compute the reliability and availability
of a network. However, we will focus on the most straightfor-
ward approach. First, one should state that the network field uni-
formly computes the availability of a network and call this the
network reliability. The model of a network is based on the
mathematical graph of the network where network nodes (ver-
tices) represent computers, printers, terminals, etc. and the
branches (arcs) in the network represent the communication
links between the nodes. The simplest model assumes that the
nodes do not fail but the links do. We define a path as a collec-
tion of branches that provides communications between a speci-
fied pair of nodes.
The four-node network shown in Figure 4 has six branches. One
method of analyzing the network is to examine all combinations
of the six branches, where each branch can be either up or down.
For example, the combination called event 41 is denoted by E41
= 123456 where links 1, 3, and 4 are down but 2, 5, and 6 are
up. The combination E41 provides a path between nodes a and b,
so we would say that E41 provides two terminal connections
between ab. A detailed study of E41 shows that it also provides
a connection between all the other node pairs. Thus, we can say
that E41 provides all terminal connection. Since each of the
branches has two states (up and down) and there are 6 branches,
there are 26 = 64 different combinations of events. All of the
events as formulated are mutually exclusive. Thus, the two ter-
minal reliability between nodes s and t is the sum of the proba-
bilities of each event that provides connection between s and t.
Similarly, the all-terminal reliability is the sum of the probabili-
ties of all the events that provide all terminal connection. Again,
we emphasize that the term reliability really means availability
because we speak of each branch being up or down, which
means that they can fail and be repaired.
Figure 4. A Four-Node Graph Representing a Computer or
Communication Network (Reference 3, Figure 6.1, page 285)
The preceding method for calculating reliability is called State-
Space Enumeration and will become clearer as we fill in the
details for calculating the two terminal reliability, Rab. The eas-
iest way to explore the 64 combinations is to group them by the
number of failed branches and explore each grouping. There is
only one combination of no failures, E1 = 123456, and clearly
this provides a connection between a and b. There are 6 combi-
nations of one failure and all provide connection between a and
b: E2 = 123456, E3 = 123456, E4 = 123456, E5 = 123456, E6
= 123456, and E7 = 123456. Clearly, as long as edge 1 is up,
there is a connection, and E3 through E7 provide a connection. In
the case of E2 although 1 is down, 25 provides a connection. In
a similar manner, all the events are explored and the conclusion
is that there are 48 good events, [the details are given in
1
2
3
4
5
6
b
cd
a
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
6
Reference 3, pp. 288-292]. The reliability expression is given by
Equation (1).
Rab = [P(E1)] + [P(E2) + ×××× + P(E7)] + [P(E8)
(1)
+ ×××× + P(E22)] + [P(E23) + ×××× + P(E34)
+ P(E37) + ×××× + P(E42)] + [P(E43)
+ ×××× + P(E47) + P(E50) + P(E56)] + [P(E58)]
If all edges are identical and have a probability of being up of p
and a probability of being down of q, then Equation 1 becomes:
Rab = p6 + 6 p5q + 15 p4q2 + 18 p3q3 + 7 p2q4 + pq5
(2)
For the case where p = 0.9 and q = 0.1, substitution in Equation (2)
yields an availability of 0.997848. To improve network reliability
we would have to increase p or add more branches to the network.
Many other more sophisticated techniques are available for com-
puting the availability of a network, including approximate tech-
niques and computer programs to help with the computations.
There are network design techniques that help with the design of
a reliable network given a set of nodes to connect, and others that
suggest how to obtain the biggest increase in reliability by
adding additional branches to an existing network.
Apportionment and Redundancy Optimization
In discussing redundancy in the preceding section, it was
assumed that the unit in question was a component or a modest-
ly sized system. A more global issue is, given a large system that
has unsatisfactory reliability or availability, how can we best
improve the reliability or availability by using a limited amount
of redundancy in the system. This is essentially top down redun-
dancy and although less efficient than bottom up redundancy, it
fits in well with a system engineering approach to system design.
Assume a block diagram represents the overall structure of the
system, where each level of decomposition is represented by n
submembers. If the decomposition produces 2 or 3 submembers,
too little decomposition occurs; if there are more than 10 sub-
members things are too complex. For a rationale for why 5-9 sub-
members is a good goal [Reference 3, pp. 337-342]. In general,
one has an overall system goal Rs (or As) and the reliability of all
the submembers combine to yield the system reliability. If all the
n submembers are independent and have an equal reliability r,
then Rs = rn. Since we know Rs we can solve for r obtaining r =
(Rs)1/n. This is a gross approximation and an easy first start.
Assigning a set of reliabilities for the submembers that equals the
system goal is generally called apportionment.
A brief study of the reliability of each of the submembers will
show how easy it is to reach the r goal for each submember. In
general, some are reachable and some will be very difficult.
Seldom does one discover that all the submember goals are easily
met. However, sometimes all of them appear to be insurmountable
goals. A better procedure than the equal reliability assumption is
to use the preliminary analysis of the subsystems to establish a set
of ratios of the initial failure rates. Suppose that n = 5 and the ini-
tial design of subsystems one and two seem to have roughly equal
reliabilities and a failure rate l, then l1 = l2 = l. However, sup-
pose three and four are more difficult and have roughly double the
failure rate, l3 = l4 = 2l. Furthermore, subsystem 5 is the most
difficult and has a failure rate l5 = 6l. Although these are all just
initial estimates, one can use them to obtain a better second round
of estimates. The system reliability becomes:
Rs = (e-
lt
)(e-
lt
)(e-2
lt
)(e-2
lt
)(e-6
lt
) = e-12
lt
(3)
Solving for l yields l = -ln(Rs)/12 and, using this value for l, we
obtain reliability goals for each of the submembers. It is always
useful to apply these simple techniques of apportionment (and
sometimes the more complex procedures [Reference 3, pp. 342-
351]) in the initial design stages.
One can formulate a more exact approach to reliability opti-
mization using redundancy. As an example, assume that n = 3,
that the initial estimates for the reliability of each subsystem is
R1 = 0.85, R2 = 0.5, and R3 = 0.3. The system goal is Rg = 0.9;
however, the initial design has a reliability of 0.85 x 0.5 x 0.3 =
0.1275. Each subsystem has a cost of 1 unit and the total cost
budget is 16 units; thus 13 units are available to assign parallel
elements to the three subsystems to achieve the system reliabili-
ty goal. Our objective is to find the set of solutions where the
system reliability Rs ³ Rg, the cost of each of these solutions, and
then to choose among these possibilities.
A brute force approach is to recognize that given the cost con-
straints, we could assign to subsystem one, 0 to 13 parallel redun-
dant units. Similarly, there are 14 different choices for subsystems
two and three, and the total space of possible combinations is 14 x
14 x 14 = 2,744! Clearly with modern computational power, a
brute force computation and ordering of the 2,744 cases is possi-
ble. However, such an approach becomes impossible for any prac-
tical size system. In the past, gradient search approaches (greedy
algorithms) and dynamic programming have been used to reduce
the number of cases that must be compared [Reference 3, pp. 369-
378]. Recently, a bounded enumeration approach has been dis-
covered that allows a great reduction in effort only 10 combina-
tions need to be explored [Reference 3, pp. 351-366]. The first
bounds to apply are the lower bounds that are derived from the fact
that each subsystem must have a reliability equal to or exceeding
the system goal and these bounds assume all other subsystems
have a reliability equal to unity. Thus,
(1 - [1 - Ri]n
i
) > Rg
(4)
Solving for ni from Equation (4) yields:
ni = log(1 - Rg)/log(1 - Ri)
(5)
For the three subsystems of our example, n1 = 1.21, n2 = 3.32, and
n3 = 6.46. Rounding to the next highest integer, n1 = 2, n2 = 4,
(Continued on page 11)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
11
Unreliable System Costs >75,000 Patients Lives Annually,
Reliability Review, published by the Reliability Division of
ASQ, September 2002, page 3. Poor process reliability is result-
ing in the spread of infections in hospitals leading to more than
75,000 deaths annually. These infections are preventable if the
same approach used to develop and control industrial processes
is applied by hospitals.
Step-Stress Accelerated Life Testing: Test Plan, Model, and
Application, Reliability Review, published by the Reliability
Division of ASQ, September 2002, page 13. A plan and model
for conducting accelerated life testing is presented together with
a case study in which the model is applied.
Human Systems Integration, Program Manager, published for
the Defense Acquisition University, July-August 2002, page 88.
The authors explore the importance of designing for interaction
of human beings with everything in the environment associated
with systems. They describe the organizational and analytical
processes associated with human systems integration.
ISO/TS 16949 the Clear Choice for Automotive Suppliers,
Quality Progress, published by ASQ, October 2002, page 44.
General Motors Corp., Daimler-Chrysler, and Ford Motor
Company are requiring their suppliers to transition from QS-
9000 to ISO/TS 16949 and to be certified for the latter by July
2004. The change is intended to facilitate access to various
world markets.
Reliability of Fault-Tolerant . . .
(Continued from page 6)
and n3 = 7. Thus, 13 units have been allocated (including the ini-
tial 3), and only 3 cost units are left. Consequently, four addi-
tional possibilities remain for each of the three units, 0,1,2,3 addi-
tional units, and the number of cases to enumerate is 4 x 4 x 4 =
64. This number can be further reduced by calculating the upper
bounds. Consider the allocation of 3 units to subsystem 1. If this
is done, no additional redundancy can be applied to subsystems 2
and 3 because all the resources are expended.
If we consider allocations above the minimum bound as incremen-
tal policies, Dn1 = 3, Dn2 = 0, Dn3 = 0, or, more compactly, the incre-
mental policy is (3,0,0), and sixteen of the 64 combinations are
eliminated. Similar reasoning leads to the conclusion that (2,1,0)
and (2,0,1) are allowed policies. Similarly (1,2,0), (1,1,1), and
(1,0,2) are possible policies. Lastly, (0,3,0), (0,2,1), (0,1,2), and
(0,0,3) are possible solutions for a total of 10 policies to explore.
The solutions outlined are clearly displayed by the search tree
shown in Figure 5. The branches radiating from the top level node
represents the four incremental choices for n1, and the branches
radiating from these four nodes represent the nine possible combi-
nations for n1 and n2. The final branches represent the six addi-
tional choices that can be made. The numbers at the bottom of each
terminating path are the reliabilities achieved by the policy.
Inspection of Figure 5 shows that the optimum incremental pol-
icy is (1,1,1) with a reliability of 0.9098 and that the incremen-
tal policy (0,1,2) is a close second with a reliability of 0.9087.
Practical considerations that are unstated may make the choice
of the second or perhaps third best policy advisable in some
cases. This simple example explains the method and illustrates
the huge reduction in computational cases due to the lower and
upper bounds. It is fairly easy to program the algorithm and a
personal computer provides ample computational power even in
a large-scale problem.
Conclusion
This article has highlighted some of the more important topics
that must be considered in the area of fault-tolerant computing.
Because of the complexity of digital systems, many of these
techniques must be used in combination to design a highly reli-
able, highly available complex system.
References
1.
Hamming, R.W., Error Detecting and Correcting Codes,
Bell System Technical Journal, 29 April 1950, pp. 147-160.
2.
Shooman, M.L. and C. Marshall, A Mathematical
Formulation of Reliability Optimized Design, Proceedings
of 16th IFIP Conference on System Modeling and
Optimization, Compiegne, France, July 5-9, 1993. (Also
published in Henry, J. and J.P. Yvon, System Modeling and
Optimization, Lecture Notes in Control and Information
Sciences 197, Springer-Verlag, London, 1994, pp. 941-950.)
n1
n2
n3
0.8602
0.8830
0.8885
0.9002
0.9098
0.8966
0.9067
0.9087
0.8905
0.8900
3
2
1
0
0
1
2
01
0123
1
2
31
2
1
Figure 5. A Search Tree for the Example (Reference 3, Figure 7.6, page 359)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 2
12
3.
Shooman, M.L., Reliability of Computer Systems and
Networks: Fault Tolerance, Analysis, and Design, John
Wiley, New York, 2002.
4.
Siewiorek, D.P. and R.S. Swarz, Reliable Computer Systems
Design and Evaluation, 3rd ed., A.K. Peters, , 1998.
5.
Wicker, S.B. and V.K. Bhargava, Reed-Solomon Codes and
their Applications, IEEE Press, New York, 1994.
About the Author
Martin L. Shooman, PhD, is currently President of the consult-
ing firm of Martin L. Shooman & Associates. He served for
many years as a Professor of Electrical Engineering and
Computer Science at Polytechnic University in Brooklyn, NY.
Dr. Shooman has been a Visiting Professor at MIT and Hunter
College, and a consultant to Bell Laboratories, NASA, IBM, the
U.S. Army, and many other government and commercial organ-
izations. A Fellow of the IEEE, Dr. Shooman has received five
Best Paper awards from the IEEE Reliability and Computer
Societies. He has contributed to over 100 papers and reports to
the research literature, and has given special courses in Britain,
Canada, France, Israel, and throughout the U.S.
Dr. Shooman is the author of Probabilistic Reliability: An
Engineering Approach and Software Engineering: Design,
Reliability, and Management. His latest book is Reliability of
Computer Systems and Networks: Fault Tolerance, Analysis, and
Design. Published in January 2002, the book is 552 pages and is
priced at $110.00. It can be ordered from the publisher, John
Wiley & Sons.
ISBN: 0-471-29342-3
NORTH AMERICA: John Wiley & Sons, Inc.
1-800-225-5945
REST OF THE WORLD
John Wiley & Sons, Ltd.
+44(0) 1243 779777
RAC Introduces Its New Professional Sustainment Program Series
By: Tim Cathcart and Joel Manary
Introduction
Sustainment professionals need a parallel set of skills and tools.
One set needs to focus on management aspects of integration of
support elements and integration of sustainment issues with other
program management functional areas. Numerous training and cer-
tification programs address this need. The other set needs to focus
on the engineering and other technical aspects of sustainment in a
practical and user-oriented fashion. No training and certification in
a condensed and user-oriented context has been available.
The Reliability Analysis Center (RAC) has recently developed a
new professional certificate program specifically designed for
the unique problems currently facing the sustainment communi-
ty. The Professional Sustainment Program provides the essential
technical skills, methods, and tools to implement many new
strategies and principles required in economically sustaining an
enterprise and the products created by that enterprise.
The Professional Sustainment Program series consist of several
courses that comprehensively integrate traditional sustainability
analysis and operational effectiveness methodologies with new
research findings from several fields of studies, such as supply
chain management, integrated lean enterprises, and e-logistics.
The Office of the Secretary of Defense has directed all DoD
logistic-wide initiatives to adopt commercially proven prac-
tices and strategies. These logistic transformation objectives
include the implementation of many commercial practices such
as supply chain management and agile manufacturing and MRO
(Maintenance Repair Overhaul) concepts. Many DoD organiza-
tions have established transformation offices to implement these
new strategies. The organizations using system continue to
demand increased attention to sustainment issues affecting sys-
tems readiness, and the responsiveness of the logistics infra-
structure. The rush to field systems without settling sustainabil-
ity requirements continues to plague acquisition projects.
Professional Sustainment Program
The Professional Sustainment Program series was developed to
provide an educational resource for the sustainment community.
It will help them develop the management and technical skills
needed to design and implement cost-effective, integrated sus-
tainment networks and agile organizational structures, while still
addressing the unique problems facing the sustainment commu-
nity such as aging systems and providing life cycle support for
Commercial-Off-The-Shelf items.
Commercially proven supply chain management and lean enter-
prise practices have significantly benefited the manufacturing
and retail industries but have been difficult to apply in the
defense industry because of the high degree of variability in both
source material and low volume production requirements.
Under ideal conditions, a sustainment supply chain network
would be responsive and flexible enough to meet varying
demand conditions. The right types of material and parts would
be available in the right quantities, at the right place, at the right
time, at an affordable cost. Parts and material shortages coupled
with increased maintenance requirements are just some of the
issues facing the sustainment community in todays environ-
ment. The logistic transformation from a cold war mass pro-
duction operational model into a smaller and more mobile lean
and agile post cold war operational model will require signifi-
cant cultural and infrastructure changes.
|
|
|
|