
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F i r s t Q u a r t e r  2 0 0 1
9
lar industry is overrepresented on the committee. Membership
is free, but members must pay any expenses associated with
attending meetings. A Membership Application must be com
pleted and applicants must submit a biography (½page maxi
mum with the application. The biography should include quali
fications (degrees, experience, etc.), industry represented, and
special interests (i.e., maintainability, human reliability, etc.).
Membership application forms can be requested from Patricia
Kopp at .
For more information on Z1 and the Dependability
Subcommittee, refer to the US Standards Group web site or con
tact the Chair of the Dependability Subcommittee, Ned H.
Criscimagna, at (301) 9181526, .
New ISO President
The International Organization for Standardization (ISO) has
announced that Mr. Mário Gilberto Cortopassi from Brazil took
office as the organizations new President on January 1, 2001.
Mr. Cortopassi will serve a twoyear term.
Mr. Cortopassi is a successful industrialist. His formal training
is as a chemist, and he has gained a wealth of experience in the
textile and synthetic fiber industries. He has been actively
involved in standardization for over 30 years.
Mr. Cortopassi stated in his inaugural message that international
standards are more necessary than ever to facilitate business,
encourage free trade, and foster progress in society. He singled
out standardization, metrology, testing, conformity assessment,
and certification as key instruments in achieving business suc
cess in a global market.
The new President cited ISOs success in responding to market
driven requirements by modernizing its own processes to deliv
er standards in a timely and efficient manner. Mr. Cortopassi
called for even stronger support for ISO from its constituent
members, pointing out that ISOs success greatly contributes to
the efficiency of the global marketplace, which in turn extends
prosperity to all nations.
Statistical Analysis of Reliability Data, Part 1: Random Variables,
Distributions, Parameters, and Data
Introduction
Sometimes, engineers have problems understanding the basis for
the statistical procedures they need when analyzing reliability
data. But this is not surprising. In many engineering curricu
lums, the study of statistics is limited to one or two courses (3 to
4 credit hours). These courses are usually theoretical, do not
address data analysis, and cover a wide range of statistical tech
niques. Finally, other engineering courses emphasize the physi
cal (deterministic) rather than the stochastic laws governing the
processes under analysis.
This article is the first of a series written to provide engineers
with a practical understanding of statistical analysis of reliabili
ty data. This article discusses random variables, statistical dis
tributions and their parameters, and data collection issues,
including the special problem of outlier (or extreme value)
detection and treatment. The second article addresses parameter
estimation and hypothesis testing, emphasizing goodness of fit
procedures used to identify and select suitable distributions from
a given set of data. In the third article, the concepts from the first
two articles will be applied to reliability estimation and assess
ment problems. The fourth article discusses data collection and
data quality problems.
Statistical Distributions
Statistics deals with the study of phenomena and processes that
(1) yield more than one outcome, and (2) occur in a random fash
ion [1, 2, 3, 4, 6]. Results of the random processes under obser
vation are called random variables (RV) and are denoted with a
capital letter, say X. Specific outcomes (denoted in lower case)
are called events and the set of all possible RV outcomes is
called the sampling space. For example, from the process of
rolling two dice and taking their sum, we observe X, the random
variable sum of both dice. Similarly, from the process of life
testing we observe X, the random variable life of the device.
In the dice example, the sampling space consists of integers 2
through 12. An event (X = n) is rolling a given sum and it occurs
with a probability (P{X = n}) (Figure 1). For the life testing
example, the sampling space consists of all positive values of
time and an event {X < t} is observing a life of less than t units
(Figure 2).
The graphical pattern of occurrence of such random outcomes
(e.g., Figures 1 and 2) provides an intuitive way to understand
the meaning of the statistical distribution of an RV. The abscis
sa of such graphs represent the sampling space of X (all possible
outcomes) and the ordinate represents a value proportional to the
frequency of occurrences of the outcomes. Such graphs repre
sent the probability density function (pdf) when the sampling
space of X is continuous (Figure 2) or the mass function when it
is discrete (Figure 1). The area under the curve of the mass/den
sity function is one. The Cumulative Distribution Function
(CDF) of an RV is nondecreasing, has a value between zero and
one, and is defined for both the mass and density functions as:
F(a) = P{X £ a} where a is any feasible value in the
sampling space of X.
By: Jorge Luis Romeu, IIT Research Institute
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F i r s t Q u a r t e r  2 0 0 1
10
Hence, all random variables have a distribution, uniquely
described by one or more parameters.
The mass/density functions provide an objective, precise way to
describe the probability mechanism governing the random
process that produces them. For example, contrast the (graphi
cal) flat pattern from rolling an honest die, where the occurrence
of any of its six sides is equally likely, with that of the sum of
two dice (Figure 1), where a sum of 7 is more likely than that of
12, or with the decreasing pattern of the exponential (Figure 2).
Such patterns (distributions) can be numerically described by a
set of fixed numbers called parameters. In the sum of two dice
example, the set (1/36, 2/36, 3/36, ... 1/36) of frequencies asso
ciated with the possible sums, uniquely describe its distribution
(pattern). In the exponential case, the mean describes it.
Statistics is about investigating those distributions and parame
ters. In this series of articles both quantitative and qualitative
RVs are addressed. Quantitative RVs are numerical and exhibit
mathematical properties of order and distance. These RV are said
to have a stronger measurement scale level, which allows the
implementation of certain statistical methods, not always appro
priate for qualitative variables [5]. Qualitative RVs (e.g., attrib
utes such as pass/fail) are categorical or can be ordered at best.
Statistical distributions can be discrete or continuous, according
to whether their corresponding RV sampling space is discrete or
continuous. The result of rolling a die is an example of a discrete
RV; the life of a device is an example of a continuous RV. Their
corresponding graphical patterns yield step or continuous
mass/density functions. The probabilities for individual out
comes (e.g., rolling a sum of 2 or observing 3 failures in the field)
can be calculated for discrete RVs. The probabilities of ranges
(e.g., that a device life is longer than ten hours or between three
and ten hours) can be calculated for continuous RVs. For exam
ple, the probability of rolling a sum of three or less (denoted
P{X < 3}) is obtained by adding the discrete mass function; the
probability of observing a life of less than three hours (denot
ed P{X < 3}) is obtained by integrating the continuous pdf.
These examples illustrate the onetoone relationship between the
distributions and their corresponding mass/density functions.
In addition to being discrete or continuous, distributions can be
symmetric or skewed, according to whether their mass/density
functions are or are not symmetric with respect to one point in their
sampling space. Distributions can also be unimodal or multi
modal, or have no mode, according to whether their mass/density
functions have one or more (local) maximums (modes). The dis
tribution of the RV sum of two dice in Figure 1, is an example
of a symmetric, unimodal distribution. Its mean and mode are both
7, about which the distribution is symmetric. The exponential dis
tribution, in turn, is skewed to the right and has no mode (peak).
As may be surmised, the number of statistical distributions that
can arise is infinite, posing a difficult practical problem. To deal
with it, well known and thoroughly studied families of statis
tical distributions that are easy to manipulate and fit different
patterns and have a small and easy to interpret number of param
eters, have been developed. Two examples of discrete families
of distributions (and their respective parameters) are the
Binomial (with parameters n, number of trials and p, probability
of success at any trial) and the Poisson (with rate of occurrence
l). Two examples of continuous distributions are the Normal
(with mean m and standard deviation s) and the exponential
(with mean 1/l).
Often, the exact distribution of a random process under study is
unknown but can be satisfactorily approximated by one of these
wellknown distribution families, by finding suitable combina
tions of parameters. If we can live with the difference between
the exact probability of any event and its approximation, then we
will work with the latter as if it were its exact distribution. Much
statistical work is spent in (1) selecting a specifically wellsuited
family of distributions, (2) verifying that such selection is cor
rect, (3) estimating adequate parameters, and (4) deriving prob
abilistic results with them.
DICE
1
2
3
4
5
6
1
2
3
4
5
6
2
3
4
5
6
8
3
4
5
6
8
9
4
5
6
8
9
10
5
6
8
9
10
11
6
8
9
10
11
12
x
2
3
4
5
6
7
8
9
10
11
12
P{X = n}
0.028
0.056
0.083
0.111
0.139
0.167
0.139
0.111
0.083
0.056
0.028
x is the Sum of Two Honest Dice
P{X = n} is the probability of two honest
dice adding up to a particular value, n
2
3
4
5
6
7
8
9
10
11
12
0.028
0.056
0.083
0.111
0.139
0.167
P{X
=
n}
n
7
7
7
7
7
7
Figure 1. Dice graphical pattern
0
5
10
15
20
25
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
Median (= 6.93)
Mean (= 1/ = 10)
f(t)
=
et
Figure 2. Exponential distribution with mean of 10
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F i r s t Q u a r t e r  2 0 0 1
11
The previous discussion shows that it is important to fully under
stand the concepts of RVs, their distributions and their correspon
ding parameters, because they provide an objective and precise
way of describing a random phenomenon under study. Applying
these concepts to a data set provides practical and useful, proba
bilistic statements on events of interest, such as what is the
Reliability of the device, if its mission time is ten hours?
Conversely, a prespecified probability (e.g., Reliability = 0.99)
may be required by designers or the procurement office, as the per
formance measure of a device. Therefore, samples of such devices
may be drawn and tested for compliance with these requirements.
Distribution Parameters
Parameters are populationfixed values that uniquely characterize
and help describe the distribution of a RV (e.g., l in the expo
nential distribution). Parameters allow the graphing of the RV
specific mass/density function (outcome patterns). The location,
dispersion, shape, scale, and threshold parameters, all of which
are widely used in Reliability applications, will be discussed.
Location parameters respond to the question Where is the dis
tribution? A particularly useful subset of the location parame
ter is given by the three measures of central tendency: mean,
median and mode. The mean is the outcome located at the cen
ter of gravity of the mass/density function graph. The median is
the outcome such that half the population scores below (or
above) it. The mode is the value where the mass/density func
tion peaks (most frequent outcome). Mean and median are
unique but multiple modes may coexist (in a multimodal distri
bution).
If a distribution is symmetric and unimodal (e.g.,
Normal) then the mean, median and mode coincide. If it is
skewed (e.g., exponential), they will differ.
If a distribution is skewed (nonsymmetric), then one tail is
longer than the other is, and the mean is less important than the
median and mode. For example, the mean of the RV household
income may have little meaning if the population consists of
several billionaires and millions of landless peasants (it provides
little information about the situation). In such a case, (1) the
median income level is such that half the population income lies
above and below it, and (2) the modal income level is that which
is most frequent and around which there is some population clus
tering. The latter two parameters provide more useful and mean
ingful information about the population income. In addition, if
we add (subtract) a few billionaires to the population, the mean
will be affected, whereas mode and median will be much more
resilient to such changes. Such resilience is referred to as the
robustness of a parameter and is considered a good quality.
Other location parameters of interest are the quartiles and the per
centiles. A percentile is an outcome value within the sampling
space of the RV such that a given percent of the population scores
a result less than or equal to such outcome. For example, by defi
nition the median is the fiftieth percentile (because 50% of the pop
ulation scores less than or equal to it). Other important percentiles
are the lower (1st) and upper (3rd) quartiles. They define values
where 25% of the population (75% of the population) are less than
or equal to such quartiles. Between the 1st and the 3rd quartiles lie
half of the population closest to the center (median).
The
Characteristic Life of the Weibull distribution is an example of a
percentile (63.2%) with a wellknown engineering interpretation.
Dispersion parameters respond to the question How does the
random process vary about some location parameter? Some
wellknown dispersion parameters are variance, range, and
Interquartile Range (IQR). The standard deviation is the square
root of the variance. In a Normal distribution, the standard devi
ation yields the distance from the mean to the abscissa of the
inflection point of the density function. The range is the differ
ence between maximum and minimum outcomes. The IQR is
the difference between the upper and lower quartiles.
Dispersion parameters are used to characterize or compare popula
tion variability. And variability is always associated with risk in
statistics. If, for example, the means of two positive RV are the
same, their variances can be compared directly. But if the means
differ, then an indirect dispersion parameter, such as the Coefficient
of Variation (the ratio of the standard deviation to the mean, for a
positive RV) is used. Also, as distributions depart from symmetry,
the IQR is more useful than the variance for the same reasons that
the median and mode are more useful than the mean.
By varying the shape and scale parameter, a specific family of
distributions can describe a specific population (i.e., by obtaining
a good fit or approximation to the exact RV distribution). A
Weibull, for example, can approximate a Normal or exactly
describe the exponential by adjusting its shape parameter. Other
useful parameters include the threshold parameter, which pro
vides a lower bound for the RV range of possible values. The
Weibull [4] is a good example of such a threeparameter distribu
tion. It is also worth noticing that, in most distributions, the mean
and variance are no longer density function parameters (e.g.,
Normal) but are obtained as a function of the shape and scale.
Finally, Skewness and Kurtosis are two parameters that describe
a distributions degree of (dis)symmetry and peakedness.
Parameters help visualize the outcome patterns of an RV, which
allows us to better understand them.
Extreme Values or Outliers
Data analysis begins with identifying a suitable family of distri
butions, and its corresponding set of parameters, that accurately
characterizes the random phenomenon under study. We then can
analyze the distribution behavior, especially in the tails, where
the real action takes place. For it is in the distribution tails
where a distinct behavior really occurs, a fact particularly impor
tant in hypothesis testing. Hypothesis testing allows us to ascer
tain whether an unusually high or low observation may have a
reasonable probability of occurrence, or whether such an unusu
al observation constitutes a rare event under the current model
assumptions, signaling out a possible anomaly (e.g., some
assumptions made are wrong).
(Continued on page 14)
T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r
F i r s t Q u a r t e r  2 0 0 1
14
An outlier or rare event is defined as an observation (in the
tails of the RV range) that occurs with a very small probability.
It is incorrect to believe that an outlier is always an erroneous
observation or that it should be automatically removed from the
sample. In the dice example, the sum 12 occurs with probabili
ty 1/36=0.028 but may occur at any trial with that probability.
We may perform the dice experiment three times in a row and
roll three sums of 12 (an event that occurs with probability 2.14
x 105, very small but not zero). As another example, if the life
of a device is exponentially distributed with a mean (i.e., 1/l) of
100 hours, we may observe one device that lasts more than 500
hours, although the probability of such an event is only 0.0067.
These outlying events seldom occur, but they can, and some
times do! They may provide grounds for us to believe that (1)
the dice are loaded or that the actual mean life of the device is
more than 100 hours or (2) that we have been extremely lucky or
unlucky and have observed a rare event. The occurrence of
lowprobability results raises a red flag but does not ensure foul
play. What statistics provides is a useful and scientific context
in which to analyze them.
For example, in a particular life test we may observe that a large
number of otherwise acceptable devices fail. We observe that in
all previous life tests (say 99) of the same device, we did not
observe such a high number of failures. Such a result is a rare
event (occurs once in 100 times), and we may be tempted to auto
matically discard it as an anomaly and assume the information
provided is erroneous. But we may well be discarding very useful
information. It may happen that, say, an unusual combination of
humidity, temperature and pressure, that only occurs once in a 100
times, greatly affects the failure mechanism of the device. And it
may be that the life test in which we have observed such large
number of failures was conducted precisely under such unusual
conditions. If instead of discarding these unusual test results as
outliers we submit them to further lab and statistical analyses,
we may be able to discover the real reasons behind them.
On the other hand, rare events and outlying observations often
result from clerical errors or some other unrelated circumstances.
Such cases do warrant discarding the unusual observation because
it no longer represents the population under analysis. Only in
this case is it proper to remove such elements from the data set.
Data Collection
Weve discussed observations of events, data points obtained
by gathering information from the population of interest or under
study. Such data constitute the life and blood of statistical analy
sis. Hence, the next few paragraphs focus on the important sub
ject of data collection.
We collect a sample of data from an entire population to study it
and do not have the time or means to look at it in its entirety. But
we want our data analysis results to be valid for the entire popu
lation and not just the sample. To extend our analyses results
from sample to population (called extrapolation in statistical
terms), the sample must meet several criteria.
The sample must be representative of the population. Hence, the
sample must be randomly drawn from the entire population of
interest and sample elements must be independent. A draw is
random when every element has the same probability of being
selected. Two draws are independent if one result does not, in
any way, affect the other.
Finally, data collection is very expensive and time consuming.
On the one hand, we strive to get as much data (information) as
we can afford. The more information we obtain (larger sample),
the smaller the margin of error and the more precise the esti
mates. On the other hand, time and budget constraints force us
to work with samples much smaller than we might desire. Good
statistics helps us to extract as much information as possible
from these samples or to define the optimal sample size to meet
our requirements.
Conclusions and Summarization
Statistical analysis is more than just the mechanical application
of a set of fixed procedures and equations. In fact, many statis
tical procedures and equations result from the systematization of
the process of scientific experimentation, developed under cer
tain statistical assumptions and conditions. If such underlying
assumptions and conditions (e.g., normality, independence,
homogeneity of variances, etc.) are not met, then the analysis
results obtained from the statistical procedures used are not valid
or will have a different statistical interpretation (i.e., different
probabilities of occurrence).
This article and those that follow in the series provide addition
al insight into the statistical thinking process. By applying sta
tistical thinking to their analysis, engineers will improve their
use of statistics as a reliability analysis tool and will extract
greater benefits from their data analysis work.
Bibliography
1. Mann, N., R.E. Schafer and N. Singpurwalla, Methods for
Statistical Analysis of R and Life Data, Wiley, New York, 1974.
2. Reliability Analysis Center, Reliability Toolkit: Commercial
Practices Edition, Rome, NY, 1994.
3. Rohatgi, V.K., An Introduction to Probability Theory and
Mathematical Statistics, Wiley, New York, 1976.
4. Romeu, J.L. and C. Grethlein, Statistical Analysis of
Material Property Data, AMPTIAC, Rome, NY, 2000.
5. Romeu, J.L. and S. GlossSoler, Some Measurement
Problems Detected in the Analysis of Software Productivity
Data and their Statistical Consequences, Proceedings of the
1983 IEEE COMPSAC Conference, Pages 17 to 24.
6. Ross, S.M., Introduction to Probability and Statistics for
Engineers and Scientists, Wiley, New York, 1987.
Statistical Analysis of Reliability Data
(continued from page 11)



