This is just an Excerpt from a larger document, click here to view the entire document.Practical Methods to Verify Normal Assumptions
In this section we discuss several empirical and practical methods for assessing the validity of two important and widely used distributions: the Normal and Lognormal. We illustrate this validation process via the life test data in Table 3. This sample (n =
45) was taken from the Normal (20, 7.6) process that generated Figure 1, presented in Section 2.
Table 3. Large Sample Life Data Set (sorted)
6.1448
6.6921
6.7158
7.7342
9.6818
12.3317
12.5535
13.0973
13.6704
14.0077
14.7975
15.3237
15.5832
15.7808
15.7851
16.2981
16.3317
16.8147
16.8860
17.5166
17.5449
17.9186
18.5573
18.8098
19.2541
19.5172
19.7322
21.9602
23.2046
23.2625
23.7064
23.9296
24.8702
25.2669
26.1908
26.9989
27.4122
27.7297
28.0116
28.2206
28.5598
29.5209
30.0080
31.2306
32.5446
In our data set, two distribution assumptions need to be verified or assessed: (1) that the data are independent and (2) that they are identically distributed as a Normal.
The assumption of independence implies that randomization (sampling) of the population of devices (and other influencing factors) must be performed before placing them on test. For example, device operators, times of operations, weather conditions, location of the devices in warehouses, etc. should be randomly selected so they become representative of these same characteristics and of the contexts in which devices will normally operate.
To assess the Normality of the data, we use informal methods, based on the properties of the Normal distribution. They seem appropriate for the practical engineer, since they are largely intuitive and easy to implement.
To assess data, we must first obtain their descriptive statistics (Table 4). Then, we analyze and plot the raw data in several ways, to check (empirically but efficiently) if the Normality assumption holds.
There are a number of useful and easy to implement procedures, based on well-known statistical properties of the Normal distribution, which help us to informally assess this assumption. These properties are summarized in Table 5.
Table 4. Descriptive Statistics of Data in Table 3
Statistics
Normal Sample
N
45
Mean
19.50
Median
18.56
Std. Dev.
7.05
Minimum
6.14
Maximum
32.54
Q1
15.06
Q3
25.73
Where Mean is the average of the data and the Standard Deviation is the square root of:
S2 = [Σ (xi - x)2 ] / [n - 1]
Table 5. Some Properties of the Normal Distribution
Mean, median, and mode coincide; hence, sample values should also be close.
Graphs should suggest that the distribution is symmetric about the mean.
About 70% of the data should be within one standard deviation of the mean.
About 95% of the data should be within two standard deviations of the mean.
About 1% of the data, should be beyond three standard deviations of the mean.
Plots of the Normal probability and Normal scores should be close to linear.
Regressions of these probability and score plots should yield Unit slope.
First, from the descriptive statistics in Table 4, we observe that the sample Mean (19.5) and Median (18.56) are close, and how the Standard Deviation is 7.05. This supports the Normality of the distribution by Property No. 1, in Table 5.
The distribution looks symmetric about mean = 19.5, as suggested by the following Box Plot (plot of minimum, Q1, median, Q3, and maximum). Observe how the centered 50% of the data (between Q1 = 15.06 and Q3 = 25.73) is dispersed about the mean.
The histogram (Figure 3) also suggests some symmetry about Mode = 18 (center of the interval with the highest frequency in Figure 3). All of which, by Property No. 2 in Table 5, suggests the validity of the Normal distribution.
Figure 3. Histogram of the Normal Data Set (Mode is 18) (Click to Zoom)
The interval defined by one standard deviation about the mean:
(μ - σ, μ + σ) = (19.5 - 7.05, 19.5 + 7.05) = (12.4, 26.1) includes 28 values (in ranks 7 to 34, of sorted Table 3) representing 62% of the total data set (close to the expected 68.25%). The interval (μ - 2σ, μ + 2σ) = (5.4, 33.6) includes values in ranks 1 to 45 (i.e., all data) representing 100% of the data set (close to the expected 95%). There are zero values beyond μ±3σ, supporting the statement that about 1 point (about 1% of the values) would be outside the interval (μ - 3σ, μ + 3σ). All these results support Properties 3 to 5 of Table 5.
In the Probability Plot, the Normal Probability (PI) is plotted vs. I/(n + 1) where I is the data sequence order, i.e., I = 1,..., 45. Each PI is obtained by calculating the Normal probability of the corresponding failure data, XI using the sample mean (19.5) and the standard deviation (7.05). For example, the first (I = 1) sorted (smallest) data point is 6.15:
The data point is then plotted against the corresponding I/(n + 1) value, 1/46 = 0.0217 and so on, until done with all sorted sample elements I = 1,..., 45.
When the population is Normal, the Probability Plot (Figure 4) follows an upward linear trend, with unit slope. Hence, the linear regression of the Normal Probability vs. Data Rank must also reflect this one-to-one relation, via achieving a unit slope:
NormProb = -0.0228 + 1.01 NormRank
Predictor
Coef
Std. Dev.
T
P
Constant
-0.02282
0.01192
-1.91
0.062
NormRank
1.00783
0.02076
48.54
0.000
S = 0.03933
R-Sq = 98.2%
R-Sq(adj) = 98.2%
The regression Index of Fit (R2 = 98.2%) is very high (close to 100%). Also, the P-value (0.0) for the NormRank regression coefficient T-Test (48.54) is very small, thus suggesting a linear trend. The regression coefficient (slope) itself (1.00783) is close to Unit, suggesting the Normal as the data statistical distribution. This regression slope Unit value serves as the litmus test of this graphical approach to assess Normality.
Figure 4. Plot of Normal Probability (PI) vs. I/(n + 1); I = 1, ... , n; Close to Linear, as Expected from a Normal (Click to Zoom)
The Normal scores XI are the percentiles corresponding to the values I/(n + 1), for I = 1, , n; calculated under the Normal distribution (using mean = 19.5, std-dev = 7.05). For our example, the first I/(n + 1) is 1/46 = 0.0217 and the smallest data point =
6.15:
Solving in the above equation for scores XI we get the first (I =
1) Normal score:
X1 = -2.02 x 7.05 + 19.5 = -14.24 + 19.5 = 5.26
These Normal scores are then plotted vs. their corresponding sorted data values (Figure 5). In the above example, score 5.26 is plotted against 6.15 (the smallest data point) and so on, for I =
1, , n. When the data come from a Normal Distribution, the Normal Scores plot is close to a straight line (Property 6).
Figure 5. Plot of the Normal Scores vs. the Sorted Real Data Values, Close to Linear (Click to Zoom)
We regress the Normal Scores vs. the corresponding data. The regression, if the data comes from the Normal distribution, should yield a unit slope:
NormScore = 0.487+ 1.00 NormSamp
Predictor
Coef
Std. Dev.
T
P
Constant
0.4872
0.4554
1.07
0.291
NormRank
1.00042
0.02199
45.50
0.000
S = 1.028
R-Sq = 98.0%
R-Sq(adj) = 97.9%
An Index of Fit R2 = 97.9% and a regression coefficient 1.0042, plus the Normal Probability and Normal Scores plots, suggest that the assumption of a Normal distribution is reasonable.