This is just an Excerpt from a larger document, click here to view the entire document.Practical Methods to Verify Normal Assumptions
In this section we discuss several empirical and practical methods for assessing the validity of two important and widely used distributions: the Normal and Lognormal. We illustrate this validation process via the life test data in Table 3. This sample (n =
45) was taken from the Normal (20, 7.6) process that generated Figure 1, presented in Section 2.
Table 3. Large Sample Life Data Set (sorted)
In our data set, two distribution assumptions need to be verified or assessed: (1) that the data are independent and (2) that they are identically distributed as a Normal.
The assumption of independence implies that randomization (sampling) of the population of devices (and other influencing factors) must be performed before placing them on test. For example, device operators, times of operations, weather conditions, location of the devices in warehouses, etc. should be randomly selected so they become representative of these same characteristics and of the contexts in which devices will normally operate.
To assess the Normality of the data, we use informal methods, based on the properties of the Normal distribution. They seem appropriate for the practical engineer, since they are largely intuitive and easy to implement.
To assess data, we must first obtain their descriptive statistics (Table 4). Then, we analyze and plot the raw data in several ways, to check (empirically but efficiently) if the Normality assumption holds.
There are a number of useful and easy to implement procedures, based on well-known statistical properties of the Normal distribution, which help us to informally assess this assumption. These properties are summarized in Table 5.
Table 4. Descriptive Statistics of Data in Table 3
Where Mean is the average of the data and the Standard Deviation is the square root of:
S2 = [Σ (xi - x)2 ] / [n - 1]
Table 5. Some Properties of the Normal Distribution
Mean, median, and mode coincide; hence, sample values should also be close.
Graphs should suggest that the distribution is symmetric about the mean.
About 70% of the data should be within one standard deviation of the mean.
About 95% of the data should be within two standard deviations of the mean.
About 1% of the data, should be beyond three standard deviations of the mean.
Plots of the Normal probability and Normal scores should be close to linear.
Regressions of these probability and score plots should yield Unit slope.
First, from the descriptive statistics in Table 4, we observe that the sample Mean (19.5) and Median (18.56) are close, and how the Standard Deviation is 7.05. This supports the Normality of the distribution by Property No. 1, in Table 5.
The distribution looks symmetric about mean = 19.5, as suggested by the following Box Plot (plot of minimum, Q1, median, Q3, and maximum). Observe how the centered 50% of the data (between Q1 = 15.06 and Q3 = 25.73) is dispersed about the mean.
The histogram (Figure 3) also suggests some symmetry about Mode = 18 (center of the interval with the highest frequency in Figure 3). All of which, by Property No. 2 in Table 5, suggests the validity of the Normal distribution.
Figure 3. Histogram of the Normal Data Set (Mode is 18) (Click to Zoom)
The interval defined by one standard deviation about the mean:
(μ - σ, μ + σ) = (19.5 - 7.05, 19.5 + 7.05) = (12.4, 26.1) includes 28 values (in ranks 7 to 34, of sorted Table 3) representing 62% of the total data set (close to the expected 68.25%). The interval (μ - 2σ, μ + 2σ) = (5.4, 33.6) includes values in ranks 1 to 45 (i.e., all data) representing 100% of the data set (close to the expected 95%). There are zero values beyond μ±3σ, supporting the statement that about 1 point (about 1% of the values) would be outside the interval (μ - 3σ, μ + 3σ). All these results support Properties 3 to 5 of Table 5.
In the Probability Plot, the Normal Probability (PI) is plotted vs. I/(n + 1) where I is the data sequence order, i.e., I = 1,..., 45. Each PI is obtained by calculating the Normal probability of the corresponding failure data, XI using the sample mean (19.5) and the standard deviation (7.05). For example, the first (I = 1) sorted (smallest) data point is 6.15:
The data point is then plotted against the corresponding I/(n + 1) value, 1/46 = 0.0217 and so on, until done with all sorted sample elements I = 1,..., 45.
When the population is Normal, the Probability Plot (Figure 4) follows an upward linear trend, with unit slope. Hence, the linear regression of the Normal Probability vs. Data Rank must also reflect this one-to-one relation, via achieving a unit slope:
NormProb = -0.0228 + 1.01 NormRank
S = 0.03933
R-Sq = 98.2%
R-Sq(adj) = 98.2%
The regression Index of Fit (R2 = 98.2%) is very high (close to 100%). Also, the P-value (0.0) for the NormRank regression coefficient T-Test (48.54) is very small, thus suggesting a linear trend. The regression coefficient (slope) itself (1.00783) is close to Unit, suggesting the Normal as the data statistical distribution. This regression slope Unit value serves as the litmus test of this graphical approach to assess Normality.
Figure 4. Plot of Normal Probability (PI) vs. I/(n + 1); I = 1, ... , n; Close to Linear, as Expected from a Normal (Click to Zoom)
The Normal scores XI are the percentiles corresponding to the values I/(n + 1), for I = 1, , n; calculated under the Normal distribution (using mean = 19.5, std-dev = 7.05). For our example, the first I/(n + 1) is 1/46 = 0.0217 and the smallest data point =
Solving in the above equation for scores XI we get the first (I =
1) Normal score:
X1 = -2.02 x 7.05 + 19.5 = -14.24 + 19.5 = 5.26
These Normal scores are then plotted vs. their corresponding sorted data values (Figure 5). In the above example, score 5.26 is plotted against 6.15 (the smallest data point) and so on, for I =
1, , n. When the data come from a Normal Distribution, the Normal Scores plot is close to a straight line (Property 6).
Figure 5. Plot of the Normal Scores vs. the Sorted Real Data Values, Close to Linear (Click to Zoom)
We regress the Normal Scores vs. the corresponding data. The regression, if the data comes from the Normal distribution, should yield a unit slope:
NormScore = 0.487+ 1.00 NormSamp
S = 1.028
R-Sq = 98.0%
R-Sq(adj) = 97.9%
An Index of Fit R2 = 97.9% and a regression coefficient 1.0042, plus the Normal Probability and Normal Scores plots, suggest that the assumption of a Normal distribution is reasonable.