|
|
| RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center
INSIDE
T h e J o u r n a l o f t h e
6
Keys to Reliability
Relevance
10
Mechanical Stress/
Strength
Interference Theory
13
Applying RCM
Analysis to EA-6B
Corrosion Failure
Modes
19
PRISM Column
21
Future Events
22
From the Editor
22
RMSQ Headlines
Reliability Analysis Center
Fourth Quarter - 2003
Introduction
What can reliability people and biomedical statis-
ticians learn from each other? This article begins
with an example, and then compares the state of
the art (SOTA) in both fields and indicates
crossovers what each group can learn from the
other. The article ends with a suggestion for both
groups on the use of available data.
Let's compare methods for dealing with factors
that affect system reliability. For example, is the
central processor or parallel redundant chambers
more important in semiconductor capital equip-
ment?
Answer this question by determining
derivatives of system reliability with respect to
part factors, because the larger the derivative, the
more important the factor, assuming equal costs
per unit change of factors.
Reliability Method.
Reliability people often
assume a constant failure rate. If true, the system
reliability, the probability that the central proces-
sor (CP) and at least one chamber (CH) will sur-
vive to age t, is
R(t) = exp[t/MTBFCP]*(1(1exp[t/MTBFCH])3)
Figure 1 graphs the derivatives of system MTBF
(the integral of R(t) for MTBFCP = 100 hours and
MTBFCH = 200 hours). The derivative of system
reliability with respect to central processor
MTBF is larger. That means that the central
processor MTBF has more effect on system reli-
ability than chamber MTBF.
Figure 2 shows the system MTBF for various part
MTBFs. An increase in chamber MTBF of 100
hours increases system MTBF by 40, but the same
increase in central processor MTBF increases sys-
tem MTBF from 50 to 100 hours, depending on
chamber MTBF.
Figure 1. Derivatives of System Reliability with
Respect to Part MTBFs. The upper curve is the
derivative with respect to central processor MTBF
Figure 2. System MTBF as a Function of Part
MTBFs
Survival Analysis Method. In contrast to the reli-
ability assumption of constant failure rate, bio-
medical statisticians often assume the age-specif-
ic failure rate function is a "relative risk" or "pro-
portional hazards" [Cox] model:
(t;Z) = o(t)*exp(Z*)
In this equation, is a vector of regression coeffi-
cients, o(t) is the "base" failure rate when Z is
zero, and Z is a vector of "concomitant" variables
representing factors besides age t that can account
for variation in (t;Z) from o(t). Those factors are
called concomitant because they accompany sub-
jects with factors equal to Z. The proportional haz-
By: Larry George, Problem Solving Tools
Biomedical Survival Analysis vs. Reliability:
Comparison, Crossover, and Advances
35
Years
Of Leadership
in R&M
Celebrating 35 Years
of Excellence in R&M
200
400
600
800
1000
MTBF
0.0005
0.001
0.0015
0.002
0.0025
0.003
Derivative
160
200
MTBFCP
100
120
140
160
180
200
MTBFCH
80
100
120
140
MTBF
100
120
140
180
(315) 337-0900
General Information
(888) RAC-USER
General Information
(315) 337-9932
Facsimile
(315) 337-9933
Technical Inquiries
rac@alionscience.com
via e-mail
http://rac.alionscience.com
Visit RAC on the Web
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 3
2
ards model means that the failure (a.k.a. hazard) rate function of a
subject is proportional to the base failure rate function. The
exp(Z*) term is the risk relative to the base failure rate, o(t).
Figure 3 graphs the derivatives of system reliability with respect
to the relative risks, exp(ZCP*CP) = 100 and exp(ZCH*CH) =
200, using the relative risk model. The lower curve is the deriv-
ative with respect to central processor relative risk; its greater
magnitude indicates that central processor factors have more
effect than chamber factors.
Figure 3. Derivatives of System Reliability With Respect to
Part Relative Risks. The lower curve is the derivative with
respect to central processor relative risk
Review. This example came from a recent presentation in which
the author, a reliability statistician, described randomizing part
MTBFs and then using DoE and response surface analysis to
answer the importance question. Field failure rates are seldom
constant, because data is claimed to be unavailable or expensive
and because of variations in process, customer, and environmen-
tal factors, so people randomize MTBF.
Randomizing MTBF is a weak alternative to getting data and
doing statistical analysis. The only reason for randomizing dis-
tribution parameters is to represent sample uncertainty.
Probability distributions themselves represent randomness, so it
is unnecessary to randomize their parameters.
The biomedical relative risk model answers the importance ques-
tion without assumptions about the failure rate, and it incorpo-
rates concomitant factors. It has passed the test of time; it pre-
dominates in biomedical survival analysis, even though it was
introduced in 1972 [Cox].
The following two sections on the state of the art describe the
objectives, data, subjects, profession, support, standardization,
publications, software, and statistics used in biomedical survival
analysis and reliability.
Biomedical Survival Analysis SOTA
Objectives. Analysis of ages at failures, usually lightly censored
or truncated, to estimate the survivor function (a.k.a. reliability
function); to do hypothesis tests, usually to compare treatment
and control effects; to make forecasts; and to evaluate the effects
of concomitant variables using regression and multivariate
analysis. Mind-boggling variations due to stratification, censor-
ing, truncation, competing risk, and multistate models keep bio-
medical statisticians busy.
Data. Clinical trials use age-at-death data (duration of response
to treatment, time to illness or recurrence) to test hypotheses and
quantify treatment effects [Kalbfleisch and Prentice, Klein and
Moeschberger]. Clinical trials cost money, and sometimes dis-
eases are rare, so sample sizes can be small.
Subjects. In some ways, humans are relatively simple: subsys-
tems are clear, there is no sell-through time, humans operate one
hour per calendar hour, humans usually repair themselves spon-
taneously, and fairly specific failure modes are recorded in death
certificates.
Standardization. Statistical use of age-at-failure data is stan-
dardized in Food and Drug Administration, National Institutes of
Health (NIH), and drug company procedures and in insurance
company actuarial methods. Human actuarial failure rates are
published by the Center for Disease Control and Prevention
(CDC) and used by insurance companies and the Social Security
Administration.
Profession. Biomedical statistics professional organizations
the American Statistical Association (more than 16,000 mem-
bers), the Bernoulli Society, the Biometric Society, and Institute
of Mathematical Statistics are somewhat academic. Nearly
every large nation has a professional statistical organization. The
Royal Statistical Society (England) was inaugurated in 1834.
The Society of Actuaries requires comprehensive examinations
for regular membership.
Support. The biomedical statistics profession is well supported
by the federal government: NIH, National Institute on Drug
Abuse, National Center for Health Statistics, and CDC.
Statistical programs are part of the federal budget, although
many are baseball and census statistics. Drug companies and
health organizations employ hundreds of statisticians.
Publications. Academic publications have a high standard. The
following web site lists relevant books:
Medical journals abound with peer-reviewed case studies, some-
times contradictory. Newspaper headlines and news commenta-
tors report drug and treatment developments.
Software. Good statistical computer programsexpensive, cur-
rent, and well supportedare available. Many include survival
analysis.
Statistics. Normal distribution statistics predominate, with some
nonparametric statistics. Relative risk and proportional hazards
models are widely used to represent concomitant variables and
0.002
0.004
0.006
0.008
0.01
Cum. Intensity
Derivative
-0.0005
-0.001
-0.0015
-0.002
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 3
3
quantify their effects. Biostatisticians use these models for test-
ing hypotheses about concomitant variables, without estimating
failure rate functions.
The stochastic integral-martingale representation of the cumula-
tive failure rate function is widely used to prove asymptotic
results, even to evaluate derivative options in stock markets
(Black-Sholes).
Computer-aided tomography (CAT), nuclear magnetic resonance
(NMR), positron emission tomography (PET), and encephalog-
raphy statistical methods play an important role in biostatistics,
although they aren't survival analysis methods.
Reliability SOTA
Objectives. Predict, monitor, and improve reliability and use
reliability information for design, process, and service decisions.
The emphasis is on prediction and test, often accelerated, in
addition to analysis of field data in the hands of customers.
Systems are complex, and failure modes are sometimes record-
ed, sometimes masked, and sometimes unknown. Sell-through
time and age measures other than calendar age complicate relia-
bility estimation and use. Service often takes a secondary role in
companies eager to keep their sales revenue. Field data is usu-
ally highly censored. Reliability shares with biomedical statis-
tics variations due to censoring, truncation, competing risk, mul-
tistate models, and multivariate age measures, but not stratifica-
tion. Step stress and fatigue failure models don't have counter-
parts in biostatistics.
Data. Age-at-failure data is expensive and corrupted by sell-
through time and errors, and it may not be classified by failure
mode. Many companies quit tracking products and service parts
by serial number, so they don't have age-at-failure data.
Reliability tests suffer from the same cost limitations as clinical
trials. In the aviation industry, only about 75 "fracture-critical"
parts per aircraft are tracked by tail number, hours, and cycles.
The result is that field reliability is seldom known and used.
Standardization. The Federal Aviation Administration (FAA),
National Highway Traffic and Safety Administration (NHTSA),
and Nuclear Regulatory Commission (NRC) rely on information
from, and negotiate with, the organizations they regulate.
Military Standards are either too procedural in nature or have
been canceled. The Baldrige National Quality Program and ISO
9000 ignore reliability.
There are some bright spots, however. Markovian cost analysis
of isotope separation plants was first done a long time ago. In
the late 1950s, RAND adapted actuarial methods to engine man-
agement for the Air Force Logistics Command. NASA has good
reliability-based diagnostics for the space station, but not for the
space shuttles. Technical Committee 56 of the International
Electrotechnical Commission has a series of standards that deal
not only with the programmatic aspects of reliability but also
with the associated statistical tools.
Profession. Most professional reliability organizations are small
divisions of relatively nonacademic organizations: Institute of
Electrical and Electronics Engineers (IEEE), American Society
for Quality (ASQ), Institute of Environmental Science (IES),
Institute of Industrial Engineers (IIE), Society for Maintenance
and Reliability Professionals (SMRP), and Society Automotive
Engineers (SAE). Society Reliability Engineers (SRE) nearly
folded when George Chernowitz passed away. Institute for
Operation Research and the Management Sciences (INFORMS)
and Society for Industrial and Applied Mathematics (SIAM) are
academic. ASQ sponsors a certification program for reliability
engineers. Two major universities offer graduate degrees in reli-
ability and many others offer courses in designing for reliability
and the use of statistics in assessing reliability.
Support. The Air Force Office of Scientific Research has had no
statisticians for years. The military looks for new, advanced
weapons while trying to maintain the old ones with old methods.
NASA, NRC, and the military funded many potentially useful
paper studies, which have been forgotten. The NHTSA won't
support the statistics it needs. Companies lay off their reliabili-
ty engineers or train people from other professions to act as reli-
ability engineers.
Publications. Publications frequently print models, methods,
and estimators using standard distributions. See for a list.
Software.
The only commercially viable reliability software
seems to be for MTBF prediction, FMEA, FRACAS, ALT, RCM,
simulation, and Weibull analysis. For lists of available software,
see
and . Statistical software ven-
dors recognize the mathematical equivalence of survival analysis
and reliability in marketing their survival analysis software.
Statistics. Exponential and Weibull statistics predominate in reli-
ability. MTBF prediction has no counterpart in biomedical statis-
tics. Reliability people use accelerated life models just like bio-
medical accelerated failure time models. The stochastic integral-
martingale representation is beginning to appear in reliability,
with the Nelson-Aalen cumulative failure rate estimator and other
applications [Bagdonavicius and Nikulin, Aven and Jensen].
Crossover
Statisticians take pains to make statistical definitions consistent
with lay usage. Unfortunately, lay people believe that MTBF is
reliability. It's easy to measure something with a number and
difficult to measure it with a function, but a function is necessary
to quantify randomness. Not all reliability engineers even agree
that reliability is a probability distribution function, few man-
agers understand the concept of probability distributions.
Reliability people can learn from biomedical survival analysis.
Reliability is defined as "the probability of survival to specified
ages under specified conditions," which requires estimating the
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 3
4
survival distribution. That's what statisticians do. Of course,
reliability people should use the information learned from statis-
tical analyses, in addition to using their engineering skills.
Survival analysis includes two-sample tests that can be useful for
comparing products, processes, environments, customers, before
vs. after, and so on. Relative risk models can be used for evalu-
ating alternatives and characterizing conditions. They are final-
ly being used in reliability analyses [Bagdonavicius and Nikulin,
George and Felthauser, Krivtsov et al] and MTBF and reliability
prediction [George 2003]. Nonparametric estimators should be
adopted for field reliability estimation, because assuming math-
ematically convenient failure rate functions hides potentially
actionable information. Also, using nonparametric estimators
avoids having to defend assumptions. The stochastic integral-
martingale representation for failure rate intensity helps prove
asymptotic properties of estimators. It can help prove properties
of estimators described in the next section.
The reliability field can benefit from sharing information in the
same way as the biomedical field. Some industry paranoia regard-
ing failures prevents potentially useful comparisons and early
problem detection. Some organizations try to collect field failure
data, including Telcordia, the FAA, the Government Industry Data
Exchange Program (GIDEP), and the Reliability Analysis Center
(RAC). Information should be shared as freely as possible and
used to estimate age-specific reliability and failure rate functions.
Companies that share reliability information with customers have
a competitive advantage over those that do not.
CAT, NMR, PET, and epileptiform foci mapping encephalogra-
phy methods may be useful for condition monitoring and for
security scanning. (Security, like dependability, is a lay syn-
onym of reliability.) CAT and NMR estimate local density with-
in a body, within 3D pixels. CAT, NMR, and radiation back-
scattering measurement methods are already used in baggage
inspection and other security devices. Epileptiform foci map-
ping searches for the source of the characteristic electrical signal
of epilepsy from electroencephalograms. Reliability applica-
tions might need to identify and locate the source of an electrical
signal among all those measurable at a surface.
Biomedical people can learn from reliability analyses too.
Software reliability [Beizer] may be applicable to human cognition
and (mis)behavior. Fatigue failure (Miner's rule) and step stress
models may be useful in biomedical statistics to represent wearout
and changes in treatment. Stress-strength, FTA, and load sharing
models don't seem to have applications to biomedical survival
analysis, but perhaps readers will recognize some potential use.
Relevation (good-as-old), renewal (good-as-new), and hysterical
(somewhere in between) statistics for recurrent processes apply
to humans as well as to products. Preventive maintenance is
widely practiced on humans, but not optimally [Aven and
Jensen]. Opportunistic maintenance, driven by reliability, can
also be applied in medical treatment. Opportunistic maintenance
is the replacement of other parts at the same time as replacement
of a failed part, because as long as the system is being repaired,
the incremental cost of repairing or replacing other parts is less
than the cost of waiting. Long ago, some surgeons removed your
appendix if your abdomen was open. There may be other oppor-
tunities.
Reliability engineers are needed in medicine because of the com-
plex machinery used in hospitals, clinics, and laboratories and
because of the importance of safety. (I enjoyed working on clin-
ical laboratory equipment reliability and contributed the optimal
dilution for a cell counter and the discriminant algorithm for
WBC diff [white blood cell type percentages]). As device
implants become more common, perhaps reliability statistics
will become part of biomedical survival analysis.
Potential Advances for Both Groups
Random samples of age-at-failure data, censored or not, make
statistical analysis convenient. Suppose you only have ships
(births, installed base, production, etc.) and returns (deaths, com-
plaints, repairs, spares sales, etc.) counts by accounting interval.
Ships and returns (warranty repairs) counts (Table 1) are statisti-
cally sufficient to make nonparametric estimates of reliability
and failure rate functions, without tracking humans, parts and
products by serial number or name [George 1999].
Table 1. Monthly Ships and Warranty Repair Counts for 1988
Ford V-8 460 Drivetrain, August-December 1987
Figures 4 and 5 show nonparametric estimates of monthly fail-
ure-rate functions for age at first warranty repair and for ages
between subsequent warranty repairs. They were estimated by
least squares ()
under the assumption that repairs were a renewal process in
which the age at first warranty repair has a different failure rate
function than the rest. Maximum likelihood estimators are also
available [George 2002]. Figure 4 shows that almost 16% fail
immediately and another 4% shortly thereafter, probably in the
hands of new owners. Figure 5 shows that 13% fail immediate-
ly after repair, indicating that the problem wasn't fixed. (The
1988 Ford V-8 460 engine was the last carbureted engine Ford
made. It had drivability problems.)
These estimates have biomedical applications only for epidemics
(hantavirus), new diseases (AIDS), transplants, and other tran-
sient processes, because steady state birth and death counts con-
tain no information about age at failure. Without the linkage
between births and deaths, there is no age-at-failure information,
except during the transient portion of stochastic processes.
Using population estimates from transient infection and death
counts relieves the need for controls; this avoids the ethical
dilemma of killing controls.
Month
Shipments
Repairs
Aug-87
213
18
Sep-87
6,439
797
Oct-87
6,951
1,291
Nov-87
5,715
1,511
Dec-87
5,390
1,791
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F o u r t h Q u a r t e r - 2 0 0 3
5
Figure 4. Monthly Warranty Failure Rate Functions for Age at
First Warranty Failure
Figure 5. Monthly Warranty Repair Rates for Ages Between
Subsequent Repairs
Estimates from ships and returns counts are applicable through-
out industry for field reliability, because many products and most
service parts survive their useful lives.
Generally accepted
accounting principles require ships and returns counts for indus-
trial revenue and service cost accounting, and they're population
data, so they contain no sample uncertainty.
Privacy, important to Congress and the public, can be preserved
by use of birth and death counts for survival analysis. The law
requires that unique identification numbers be issued to all citi-
zens, health care practitioners, health care institutions, employ-
ers, and insurance companies to facilitate linking of event infor-
mation. This leads to public concern over medical privacy. The
NHTSA requires that complaints be filed by vehicle identifica-
tion number, personal identification, and detailed crash informa-
tion. This led to a stalemate between the NHTSA, directed by
Congress in the TREAD act to collect the data, and insurance
companies, which objected to providing private information.
Linkages to identify age at failure are not necessary for reliabil-
ity analysis and some survival analyses.
Conclusions
Biomedical statisticians and reliability engineers can learn much
from each other despite different objectives, and some things to
be learned by both. Survival analysts have thoroughly plowed
the field of estimation and hypothesis testing from random, cen-
sored sample, helping the fortunate few reliability engineers who
have age-at-failure and survivor data. Other reliability engineers
can make do with analysis of ships and returns counts.
Free Nonparametric Estimates
For free nonparametric estimates of field reliability, send ships
and returns counts to , or enter it in
.
The author will
send back nonparametric estimates of field reliability and failure
rate functions, free of charge.
References
1. Aven, Terje, and Uwe Jensen, "Stochastic Models in
Reliability, Springer," Berlin, 1999.
2. Bagdanovicius, Vilijandas and Mikhail Nikulin, "Accelerated
Life Models: Modeling and Statistical Analysis," Chapman
and Hall/CRC, Baton Rouge, LA, 2002.
3. Beizer, Boris, "Black-Box Testing: Techniques for Functional
Testing of Software and Systems," Wiley, New York, 1995.
4. Cox, D.R., "Regression Models and Life Tables (with discus-
sion)," J. Roy. Statist. Soc. Ser. B, Vol. 34, pp. 187-220, 1972.
5. George, L.L., "Field Reliability Without Life Data," ASA,
SPES Newsletter, , pp. 13-14, 1999.
6. "Renewal Distribution Estimation Without Renewal
Counts," INFORMS, San Jose, , 2002.
7. "Credible Reliability Prediction," ASQ Reliability Division
monograph, 2003.
8. George, L.L. and Mark Felthauser, "Reliability of Firestone
Tires," ,
2002.
9. Kalbfleisch, John D. and Ross L. Prentice, "The Statistical
Analysis of Failure Time Data," Wiley, Hoboken, New
Jersey, 2002.
10. Klein, John P. and Melvin L. Moeschberger, "Survival
Analysis, Techniques for Censored and Truncated Data,"
Springer-Verlag, New York, 1997.
11. Krivtsov, V.V., D.E. Tanako, and T.P. Davis, "Regression
approach to tire reliability analysis," Rel. Eng. and System
Safety, Vol. 78, pp. 27-273, 2002.
Acknowledgements
Ned Criscimagna suggested that I write this article in retaliation for
my suggesting that he should. Mark Felthauser, a real statistician,
reviewed the article and suggested additions.
Eva Langfeldt,
, did a wonderful job of copyediting.
About the Author
Larry George is a Certified Reliability Engineer and Fellow of
the American Society for Quality. His education includes B.S. in
Engineering, M.B.A., and M.S. and Ph.D. in industrial engineer-
ing and operations research with a minor in probability and sta-
tistics from the University of California at Berkeley. He taught
for 11 years; worked for 11 years at Lawrence Livermore
National Laboratory; and has more than 20 years experience in
industry, including several years for Abbott Laboratories'
Diagnostics Division.
0.16
0.12
0.08
0.04
0
0
3
6
9
12
probability
t = age at first warranty failure, months
0.16
0.12
0.08
0.04
0
0
3
6
9
12
probability
t = age between warranty failures, months
|
|
|
|