|
|
| RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center
INSIDE
T h e J o u r n a l o f t h e
6
Risk Management
and Reliability
14
START Sheets
14
New DDR&E Portal
14
RMSQ Headlines
15
New DoD RAM Guide
21
Future Events
22
From the Editor
23
PRISM Column
Reliability Analysis Center
Second Quarter - 2005
Introduction
A widely used measure of product reliability is
Mean Time Between Failure (MTBF). Before
making a purchase decision for manufacturing
equipment or other items, customers frequently
require the supplier to provide an MTBF.
However, MTBF is not well understood and false
assumptions often lead to poor decisions.
Improper MTBF calculations used in head-to-
head product comparisons can result in sales lost
to competitors, higher procurement and mainte-
nance costs, and customer dissatisfaction with
product experience.
This article uses technical discussion, example,
and practical application to unveil the mystery
behind MTBF. Three of the most commonly
used statistical distributions of field failure data
(exponential, Weibull, and lognormal) are
reviewed.
MTBF formulas are presented for
each distribution. Monte Carlo simulation is then
used to compare and contrast five life data mod-
els.
The relationship between MTBF and
Annualized Failure Rate (AFR) is also discussed
and a few of the complexities with performing
AFR calculations are reviewed.
Readers will be shown common pitfalls associat-
ed with reliability metrics and how they can
make more informed purchasing decisions that
will lead to an improved customer experience.
The MTBF Riddle
Is it possible for two MTBFs with the same value
to tell two completely different reliability sto-
ries? To answer this question, an experiment was
conducted using Monte Carlo simulation to com-
pare and contrast five different sets of life data
that have the same MTBF, namely 50,000 hours.
In each of these five cases, 100 data points were
generated to fit a specified life data distribution
using parameters and a seed selected to result in
an MTBF of approximately 50,000 hours.
Figures 1 through 5 show the resulting probabil-
ity density functions (pdfs).
Reliability Statistics Fundamentals
This article uses three statistical distributions that
are important in the field of reliability engineering
to model life data, namely the Weibull, exponen-
tial and lognormal distributions. We also need to
review reliability terminology and the relationship
between three fundamental reliability equations.
Firstly, the pdf is given by f(t) and represents the
relative frequency of failures over time.
Secondly, the reliability function is given by R(t)
and represents the probability that the product
survives until time t. Thirdly, the failure rate
function (t) is given by the following equation,
and is also referred to as the hazard rate h(t) or
instantaneous failure rate.
Weibull Distribution
The Weibull distribution is highly valued by the
reliability engineer because of its flexibility to
model many different life data scenarios.
For life data that fits a Weibull distribution, the
probability density function (pdf) is given by the
following equation, where is the scale parame-
ter in units of time (also referred to as the char-
acteristic life), is the unit-less shape parameter,
and is the location parameter in units of time.
By: Bill Lycette, Agilent Technologies
Practical Considerations in Calculating Reliability
of Fielded Products
R(t)
f(t)
(t) =
=
-t
-
1-
e
-t
f(t)
RAC Being Replaced by RIAC Under New DoD Contract
Effective on the 21st of June, the Defense Information Systems Agency (DISA) has awarded a contract to a team led by Wyle Laboratories, Inc. to operate the Reliability Information Analysis
Center (RIAC). The RIAC is the new name for the DoD's chartered center of excellence in the subject areas of reliability, maintainability, quality supportability, and interoperability
(RMQSI). The new name was chosen to emphasize that the Center is part of the DoD eleven-member Information Analysis Center (IAC) program administered by the Defense Technical
Information Center. The DoD RMQSI IAC will continue to meet government and industry needs under its new name RIAC. RIAC is sponsored by the Office of Secretary of Defense
(OSD). Members of the Wyle team competitively selected are Quanterion Solutions Incorporated, the University of Maryland, the Pennsylvania State University Applied Research
Laboratory, and the State University of New York Institute of Technology (SUNYIT). The Center will be headquartered at SUNYIT in Utica, NY. A special edition of this Journal will high-
light plans for the Center after the July transition from Alion Science and Technology.
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
S e c o n d Q u a r t e r - 2 0 0 5
2
Often represents shipment transit time of the product. The
Weibull reliability function is given by:
Therefore, the Weibull failure rate function is given by:
For any life data that follows a Weibull distribution, the charac-
teristic life, , is always the operating time, t, at which 63.2% of
the population is expected to fail.
A value of less than one indicates a decreasing failure rate and
is typical of infant mortality. When is equal to one, the failure
rate function reduces to that given by the exponential distribution
and the failure rate is constant at 1/. A value of greater than
one indicates an increasing failure rate and is typical of wear-out
mechanisms. When is equal to two, the pdf becomes the
Rayleigh distribution and the failure rate function increases lin-
early with time. When the value of is between three and four,
the pdf is "normal" in appearance.
The MTBF of the Weibull distribution is given by the following
equation, where is the gamma function.
Since (2) = 1, then the MTBF when is equal to one and is
equal to zero is:
MTBF =
In other words, the only time that the MTBF equals the charac-
teristic life is when is equal to one.
Exponential Distribution
For life data that fits an exponential distribution, the pdf is given
by the following equation, where is the failure rate expressed
in failures per unit time and is the location parameter in units
of time.
The exponential reliability function is given by:
Therefore, the exponential failure rate function is given by:
The exponential distribution is widely used (and often misused)
because of its simplicity, and the fact that it has a constant failure
rate . The MTBF for the exponential distribution is given by:
Note: When is 1.0, the Weibull distribution is equivalent to the
exponential distribution, i.e., MTBF = = 1/. It is also the only
scenario when the MTBF can be directly calculated using the
reciprocal of the failure rate.
Lognormal Distribution
The lognormal distribution often describes failure processes
involving degradation over time, such as corrosion, metal migra-
tion, or chemical reaction. The failure rate function is complex
and beyond the scope of this article. The times to failure are log-
normally distributed if the natural logarithm of the times to fail-
ure is normally distributed.
The MTBF of the lognormal distribution is given by:
For this equation, is the mean of the natural logarithm of the
times to failure and is the standard deviation of the natural log-
arithm of the times to failure.
Further details of these reliability distributions can be found in
the literature (References 1 and 2). Instantaneous failure rates,
reliability, and MTBFs can be easily calculated using commer-
cially available software (Reference 3).
The MTBF Riddle Explained
Returning to the experiment where Monte Carlo simulation is
used to create five unique models of life data, recall that they all
have the same MTBF of 50,000 hours.
Figure 1 shows the 2-parameter Weibull distribution ( = 0),
where is 0.5 and is 25,000. This example illustrates the clas-
sic case of infant mortality where the instantaneous failure rate is
decreasing. The pdf shows a very high percentage of early fail-
ures followed by a steep decline in the number of failures.
=
-t
-
e
R(t)
1-
-t
(t)
=
+
+
ˇ=
1
1
MTBF
)
-(t-
e
f(t)
=
)
-(t-
e
R(t)
=
=
=
e
e
(t)
)
-(t-
)
-(t-
=
1
MTBF
)
2
0.5
(
e
MTBF
+
=
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
S e c o n d Q u a r t e r - 2 0 0 5
3
In Figure 2, the 2-parameter Weibull distribution is shown where
is 1.0 and is 50,000. This result is the same that one would
get with the single-parameter exponential distribution, namely
1/ =. A value of 1.0 yields a failure rate that is constant over
time. It is also the only Weibull scenario where is equal to the
MTBF. The pdf in this case starts off at a moderately high level
and then the frequency of failures drops off steadily over time.
Figure 1. pdf for 2-Parameter Weibull Distribution: =
25,000 and = 0.5
Figure 2. pdf for 2-Parameter Weibull Distribution: =
50,000 and = 1.0
Figure 3 shows the 2-parameter Weibull distribution where is
equal to 2.0 and is 56,419. The pdf has a slightly normal
appearance and is positively skewed. The frequency of failures
starts off low, steadily increases and then gradually tapers off.
A 2-parameter Weibull distribution where is 3.0 and is
55,992, is shown in Figure 4. The pdf appears to be normally
distributed. In this scenario, a strong wear-out mechanism is at
work. The frequency of failures starts out at a very low level,
then increases rapidly and subsequently decreases rapidly.
Figure 3. pdf for 2-Parameter Weibull Distribution: =
56,419 and = 2.0
Figure 4. pdf for 2-Parameter Weibull Distribution: =
55,992 and = 3.0
Finally, in Figure 5, the lognormal distribution is illustrated
where is 10.3 and is 1.0196. The resulting pdf is similar in
shape to what is seen in Figures 1 and 2, suggesting modest
infant mortality where the failure frequency initially starts out
high but then steeply declines.
In comparing the five examples, it is clear that MTBF on its own
yields very little insight into: 1) the instantaneous failure rates
expected over the service life of the product, or 2) the expected
survival percentage (reliability function) at any point in time. In
fact, without knowledge of how life data is distributed, mistakes in
equipment or material procurement decisions are bound to occur.
Table 1 shows the expected reliability of the five life data exam-
0
3 .0 E - 5
5 .0 E - 6
1 .0 E - 5
1 .5 E - 5
2 .0 E - 5
2 .5 E - 5
0
1 5 0 0 0 02 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
f(
t)
0
3 .0 E - 5
5 .0 E - 6
1 .0 E - 5
1 .5 E - 5
2 .0 E - 5
2 .5 E - 5
0
1 5 0 0 0 02 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
f(
t)
0
3 .0 E - 5
5 .0 E - 6
1 .0 E - 5
1 .5 E - 5
2 .0 E - 5
2 .5 E - 5
0
1 5 0 0 0 0
2 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
f(
t)
0
3 .0 E -5
5 .0 E -6
1 .0 E -5
1 .5 E -5
2 .0 E -5
2 .5 E -5
0
1 5 0 0 0 0
2 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
f(
t)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
S e c o n d Q u a r t e r - 2 0 0 5
4
ples. Suppose a design engineer is developing a system that uses
a power supply assembly from two suppliers, both of whom offer
an MTBF specification of 50,000 hours. If the required reliabil-
ity is 80% at 10,000 hours, then selecting a supplier whose
power supply life data behaves as shown in Example 1 ( = 0.5)
would yield extremely disappointing results for the customer
who purchased the system.
Figure 5. pdf for Lognormal Distribution: = 10.3, =
1.0196
On the other hand, suppose the customer-use model of the sys-
tem dictates a service life of less than 1,000 hours at which point
the system is discarded. Further assume that a reliability of 95%
at 1,000 hours is acceptable. Lastly, assume that the power sup-
ply assembly from Supplier A has life data distributed as shown
in Example 2 ( = 1.0) and costs one-half of an equivalent power
supply assembly from Supplier B that has life data distributed as
shown in Example 4 ( = 3.0). Clearly, considerable cost sav-
ings could be realized by purchasing from Supplier A.
Uncertainty in Reliability Metrics
Most hardware suppliers cite a single MTBF number, i.e., they
provide a point estimate of the MTBF. However, sampling error
associated with such a metric can be significant and can lead to
costly problems.
Understanding the suppliers' confidence
bounds on the point estimate can have significant bearing on the
buying decision.
Figures 6 and 7 illustrate how two identically-distributed sets of
life data can have very different confidence bounds. In both cases,
a 2-parameter Weibull distribution represents the underlying life
data, with equal to 1.0 and equal to 50,000 hours. The only
difference between the two examples is the number of failures: 10
failure events are modeled in Figure 6 and 100 failure events in
Figure 7. Both examples show the 2-sided confidence bounds.
0
3 .0 E - 5
5 .0 E - 6
1 .0 E - 5
1 .5 E - 5
2 .0 E - 5
2 .5 E - 5
0
1 5 0 0 0 02 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
f(
t)
Table 1. Reliability of Life Data Distributions that all Have an MTBF of 50,000 Hours
Life Data
Distribution
Reliability at Mission End Time
Ex. #1
Ex. #2
Ex. #3
Ex. #4
Ex. #5
Weibull
Weibull *
Weibull
Weibull
Lognormal
Mission End Time
= 25,000
= 0.5
= 50,000
= 1.0
= 56,419
= 2.0
= 55,992
= 3.0
= 10.3
= 1.0196
100
0.936
0.998
1.000
1.000
1.000
200
0.912
0.996
1.000
1.000
1.000
500
0.865
0.990
1.000
1.000
1.000
1,000
0.815
0.981
1.000
1.000
1.000
5,000
0.636
0.906
0.992
0.999
0.924
10,000
0.529
0.820
0.967
0.994
0.777
20,000
0.409
0.671
0.876
0.953
0.536
30,000
0.335
0.548
0.743
0.850
0.382
40,000
0.284
0.447
0.589
0.681
0.282
50,000
0.245
0.365
0.438
0.473
0.214
60,000
0.215
0.300
0.305
0.275
0.166
70,000
0.190
0.243
0.199
0.129
0.132
80,000
0.170
0.198
0.122
0.047
0.106
90,000
0.153
0.161
0.070
0.013
0.087
100,000
0.139
0.131
0.037
0.003
0.072
*The exponential distribution is equivalent to the Weibull distribution when = 1.0.
0
1 .0
0 .2
0 .4
0 .6
0 .8
0
1 5 0 0 0 02 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , (t)
R
e
lia
b
ilit
y
,
R
(t)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
S e c o n d Q u a r t e r - 2 0 0 5
5
Significant uncertainty exists in the product's reliability function
when the number of failures in the reliability model is low.
Failure to factor in this uncertainty can lead to unexpected, disap-
pointing, and costly results experienced by the customer.
Figure 6. Weibull Distribution: = 1.0, 10 Failures, 90% CB
Figure 7. Weibull Distribution: = 1.0, 100 Failures, 90%
CB
Using AFR to Calculate MTBF
In the previous sections we saw how MTBF is calculated using
statistical models of field failure data. Often, field failure data is
incomplete or the expertise to create such a failure data model is
unavailable. In the absence of such information or methods, the
all-too-common (and flawed) practice is to use the familiar annu-
alized failure rate (AFR). This method involves taking the recip-
rocal of the AFR and multiplying it by the hours per year of oper-
ation time T, that is
Such a method has a number of problems associated with it. To
begin with, there is the inherent assumption that the failure rate
is constant over time, i.e., the life data follows an exponential
distribution or Weibull distribution when is equal to 1.0.
Another difficulty is determining what value of T to use.
Depending upon assumed customer use models, such as 24 hours
per day, 7 days per week (24x7) or 8x5, the resulting MTBF can
vary by as much as a factor of four.
The selection of AFR method can also introduce significant vari-
ability in MTBF results. Seemingly countless different methods
can be used to calculate field failure rates. For instance, Agilent
Technologies calculates AFR by taking the number of warranty
failures in the reporting month, dividing by the number of units
under warranty in that month, and then multiplying by 12 to
annualize the result. Jon G. Elerath's paper on the subject does an
excellent job in summarizing, comparing and contrasting a num-
ber of different techniques (Reference 4). While the reciprocal of
AFR may be useful for making a reasonable estimate of MTBF in
some cases, the reliability practitioner should at least be aware of
the built-in assumptions and potential error that this method intro-
duces.
Other AFR Considerations
In addition to the selection of AFR method, the reliability engineer
must pay careful attention to several other variables that influence
AFR. For instance, it is critical that failure mode classifications be
treated consistently when making head-to-head AFR comparisons.
Often, No Trouble Found (NTF) and overstress modes are includ-
ed in one model but not another. Another important factor is the
selection of an appropriate shipment window. Should it be based
on the past one-month, six-month or 12-month shipment history?
Or should the lowest AFR achieved over the past 12 months of
shipment history be used? Consistency of method, sustainability
of reliability, and availability of sufficient life data to assure rea-
sonable confidence bounds are important elements to consider.
Another important factor in calculating an accurate AFR is the
use of complete and accurate life data. It is best to use warranty
data for this calculation because it represents the most complete
data set typically available. Customers have financial incentives
to return warranty failures to the manufacturer for repair. This
affords the greatest opportunity for the manufacturer to collect a
complete set of failure data from a range of shipment dates. Out-
of-warranty failures may be returned to the manufacturer for
repair only in one-third or fewer instances, thus making this data
set useless for calculating AFR. Any AFR calculated from the
data set will yield an erroneous point estimate of MTBF.
Conclusions
MTBF is often cited by equipment manufacturers as the "go to"
reliability metric. However, MTBF on its own provides very lit-
tle insight into how the failure rate behaves over time or what the
expected reliability will be at any given moment. It is also
important to understand the uncertainty associated with an
MTBF estimate.
In the absence of life data modeling, MTBF is often calculated
by taking the reciprocal of AFR and multiplying it by an esti-
mated number of hours per year of operation. This method
assumes that the product's failure rate is constant over time;
however, such an assumption is frequently far from true.
Without a solid understanding of a product's life data, substantial
errors can occur when calculating MTBF and AFR. Decisions
based on flawed methods such as these can result in lost sales to
competitors, higher costs to procure equipment and material, and
disappointed customers.
Acknowledgments
The author wishes to thank John Herniman, Scott Voelker, and
Greg Larsen for their inspiration and insights that helped make
this article possible.
0
1 .0
0 .2
0 .4
0 .6
0 .8
0
1 5 0 0 0 0
2 5 0 0 0
5 0 0 0 0
7 5 0 0 0
1 0 0 0 0 0
1 2 5 0 0 0
T im e , ( t)
R
e
lia
b
ilit
y
,
R
(t)
AFR
T
MTBF =
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
S e c o n d Q u a r t e r - 2 0 0 5
6
References
1. Applied Reliability, Second Edition, Paul A. Tobias and
David C. Trindade, CRC Press, 1995.
2. Practical Reliability Engineering, Fourth Edition, Patrick
D.T. O'Connor, John Wiley & Sons, Inc., 2002.
3. Life Data Analysis Reference, ReliaSoft Publishing,
ReliaSoft Corporation, Tucson, Arizona, 1997.
4. "AFR:
Problems
of
Definition,
Calculation
and
Measurement in a Commercial Environment", J.G. Elerath,
Reliability and Maintainability Symposium Annual
Proceedings, January 24-27, 2000, pp. 71-76.
About the Author
Bill Lycette is a Senior Reliability Engineer with Agilent
Technologies. He has 24 years of engineering experience with
Hewlett-Packard and Agilent Technologies, including positions
By: Ned H. Criscimagna, Alion Science and Technology
Risk Management and Reliability
Introduction
Risk management is one of the critical responsibilities of any
manager. The term "risk management" is used by managers and
analysts in a number of diverse disciplines. These include the
fields of statistics, economics, psychology, social sciences, biol-
ogy, engineering, toxicology, systems analysis, operations
research, and decision theory.
Risk management means something slightly different in each of
the disciplines just mentioned. For social analysts, politicians,
and academics it is managing technology-generated macro-risks
that appear to threaten our existence. To bankers and financial
officers, it is usually the application of techniques such as cur-
rency hedging and interest rate swaps. To insurance buyers and
sellers, it is insurable risks and the reduction of insurance costs.
To hospital administrators it may mean "quality assurance." To
safety professionals, it is reducing accidents and injuries. For
military acquisition managers, it means identifying, prioritizing,
and managing the technical, cost, and schedule risks inherent in
developing a new weapon system.
This article discusses how an effective reliability program can be
a valuable part of an overall risk management effort for military
system acquisition programs.
What is Risk?
The American HeritageŽ and Webster dictionaries define the
term similarly. These definitions can be summarized as:
1. Possibility of suffering harm or loss: Danger.
2. A factor, course, or element involving uncertain danger:
Hazard.
3. The danger or probability of loss to an insurer.
4. The amount that an insurance company stands to lose.
5. One considered with respect to the possibility of loss to
an insurer (a good risk, e.g.)
A more general definition of risk, perhaps more appropriate for
acquisition, is:
Risk is the chance that an undesirable event might occur in
the future that will result in some negative consequences.
This latter definition of risk is often expressed as an equation
(Reference 1):
Risk Severity = Probability of Occurrence x Potential
Negative Impact
In the sense of the definition just given, risk is a part of everyday
life. We all are faced with uncertainties in our lives, our careers,
and our decisions. Since we cannot avoid such uncertainties, we
must find ways to deal with them.
Similarly, the acquisition manager faces uncertainty concerning
the technical challenges in designing a new system, and the cost
and schedule estimates. Much effort is expended in trying to
assess the technical challenges of a new program, in estimating
the costs associated with that program, and in scheduling the pro-
gram. In addition to the many constraints placed upon the man-
ager, such as budgets, timeframes, and technical state-of-the-art,
the uncertainties, or the risks, make the job of managing the pro-
gram to a successful conclusion a difficult one.
Technical risk affects cost and schedule. As stated in an article
in the Journal of Defense Acquisition University (Reference 2):
There is no dispute that there is a strong relationship
between technical risk and cost and schedule overruns, nor
is there any dispute that DoD Project Offices must assess
and mitigate technical risk if they are to be successful.
However, what must be kept in mind is that technical risk in-
and-of-itself does not directly result in cost and schedule
overruns. The moderating variable is the manner in which a
project's contract is crafted and how deftly the contract is
administered, given the nature of a project's technical risk.
As an aside, in his 1999 thesis (Reference 3) written for the
Naval Postgraduate School, James Ross identified poorly
defined requirements as one of the highest risks during pre-solic-
itation. Without very clearly defined, justifiable, and realistic
requirements, the already difficult task of risk management dur-
ing program execution is even more difficult.
What is Risk Management?
One can compare the job of program management to that of a ship
captain directing the safe passage of the vessel through waters
|
|
|
|