NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.
-
Upload
dinah-spencer -
Category
Documents
-
view
219 -
download
2
Transcript of NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.
![Page 1: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/1.jpg)
NUMERICAL ANALYSIS OF BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 2. Exploratory Data
Analysis
![Page 2: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/2.jpg)
Types of variables
Simple diagrams
Summary statistics(i) Location(ii) Dispersion(iii) Skewness and kurtosis
Transformations
Density estimation
Graphical display(i) Univariate data(ii) Bivariate and multivariate
data
Outliers
Leverage and influence
Software
EXPLORATORY DATA ANALYSIS
![Page 3: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/3.jpg)
TYPES OF VARIABLES
1) discrete e.g. counts
2) continuous e.g. pH, elevation
Both are random variables or variates, with random variation.
TABULAR PRESENTATION Raw data
Frequency tables
FrequencyCumulative Frequency
% CF
0 0 - 0.99 3 3 2
1 1 - 1.99 8 11 6
2 2 - 2.99 3 14 11
... ... ... ... ...
Value or Range
![Page 4: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/4.jpg)
SIMPLE DIAGRAMS
Dot diagram Line diagram or profile
Histogram
Frequency graph or cumulative frequency graph
n/10 bins
CONTINUOUS VARIABLES
DISCRETE VARIABLESDISCRETE OR CONTINUOUS VARIABLES
![Page 5: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/5.jpg)
HISTOGRAM BIN WIDTH Wand (1997) Amer. Statistician 51, 59-64
(a) (b) (c)
DEFAULTS-PLUS
Histograms of the British Incomes Data Based on (a) the Bin Width ĥ2 (b) the Bin Width ĥ0, and (c) the S-PLUS Default Bin Width.
Optimal solution
where g21 is band-width parameterψ2 is “normal scale” estimator
Solution of ψ2 and g21 is iterative, to optimise a function MEAN INTEGRATED SQUARED ERROR
Standard deviation n = sample size
31
493 nho .ˆ
3
1
212
26
ngh
ˆ
n21 log dataof rangeˆ
h
![Page 6: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/6.jpg)
Histogram Bin Width
In R, a good option for histogram bin width is given by the Freedman-Diaconis rule which is:
where n is the number of observations, max-min is the range of the data, and Q3-Q1 is the inter-quartile range. The brackets represent the ceiling, which means that you round up to the next integer, thereby avoiding 4.2 bins!
)(2min)(max
13
3/1
QQn
![Page 7: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/7.jpg)
Exploratory Data Analysis
1. Summary Statistics
(A)Measures of location ‘typical value’
(1) Arithmetic mean (2) Weighted mean
(3) Mode ‘most frequent’ value (4) Median ‘middle values’ Robust statistic
(5) Trimmed mean 1 or 2 extreme observations at both tails deleted
(6) Geometric mean
n
iixn
1
1 logGM log nnxxxx 321GM
n
i
xn1
11 log antilog =
n
iixnx
1
1
n
ii
n
iii wwxx
11
R
![Page 8: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/8.jpg)
(B) Measures of dispersion
A 13.99 14.15 14.28 13.93 14.30 14.13
B 14.12 14.1 14.15 14.11 14.17 14.17
B smaller scatter than A
‘better precision’
PrecisionRandom error scatter
(replicates)
AccuracySystematic bias
(1) Range A = 0.37 B = 0.07
(2) Interquartile range ‘percentiles’
25% 25% 25% 25%
Q1
Q2
Q3
(3) Mean absolute deviation
n
iii xxn
1
1
Mean absolute difference
n
i
xxn
i
1
1 ignore negative signs
x 1 5 8 23 1 4 2 10 10/n = 2.5
4xxx
![Page 9: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/9.jpg)
(5) Coefficient of variation
Relative standard deviationPercentage relative SD(independent of units)
(6) Standard error of mean
100 xsCV
SD
mean
ns2
SEM
(B) Measures of dispersion (cont.)
Variance = mean of squares of deviation from
mean
Root mean square value 2ssSD
(4) Variance and standard deviation
22
11
xxn
S
R
![Page 10: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/10.jpg)
(C) Measures of skewness and kurtosis
gg11 skewness rr = 3 = 3 33
1xx
ns[third central moment
divided by sd3]
Skewness - measure of how one tail of curve is drawn out
Kurtosis - measure of peakedness of curve
g1 skewness measure g2 kurtosis measure
“moment statistics”
Central moment =
r = 1 deviation from mean = 0
r = 2 variance
n
i
rxxn1
1
gg22 kurtosis rr = 4 = 4 31 4
4 xxns
![Page 11: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/11.jpg)
negative g1 skewness to left
positive g1 skewness to right negative g2 platykurtosis
flatter, larger tails positive g2 leptokurtosis
taller, few tails
Skewness and kurtosis
![Page 12: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/12.jpg)
(1) Comparability
(2) Better fit to model
Comparability
Data centring - deviations from mean
Data standardisation
- zero mean, unit variance
xxx ii *
Often find
Better fit Normal distribution
1 sd = 66% of values
2 sd = 95% of values
sdx,
skewed to right
positive g1
Log-normal distribution
DATA TRANSFORMATIONS
frequency
sd
66%mean 95%x
sdxxx ii *
range*ii xx
![Page 13: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/13.jpg)
LOG-NORMAL DISTRIBUTION PROPERTIES
geometric mean = median of log-normal distribution
mean of log values = Geometric mean (antilog)
SD log values CV of original values if sd
antilog
If SD larger CV =
0 5.
1Sexp 2
How to decide whether to log transform?
(1) Look at histograms. Right skewed (positive g1) log transform
(2) If sd > mean or maximum value of variable > 20x than smallest value
Log xi or Log (xi + 1)
(3) Improves normality
(4) Gives less weight to ‘dominants’ VARIANCE STABILISING
(5) Reflects linear response of many species to log of chemical variables, i.e. log response over certain ranges.
(6) In regression need normally distributed random errors. Log transformation.
![Page 14: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/14.jpg)
NORMAL AND LOG-NORMAL DISTRIBUTIONS
Normal Log-Normal
Effects Additive Multiplicative
Shape Symmetric Skewed
Mean , arithmetic *, geometric
Standard deviation s, additive s*, multiplicative
Measure of dispersion cv = s/ s*
Confidence interval 68.3%
± s
* x/s*
95.5% ± 2s * x/(s*)2
99.7% ± 3s * x/(s*)3
x/ = times / divide (cf ± plus / minus); cv = coefficient of variation
x
x
x
x
x
x
x
x
x
![Page 15: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/15.jpg)
METHODS FOR DESCRIBING LOG-NORMAL DISTRIBUTIONS
Graphical methods
Frequency plots, histograms, box plots
Parameters
Logarithm of x
Mean
Median
Standard deviation
Variance
Skewness and kurtosis of x
Problems
What logarithm base to use?
Parameters are not on the scale of the original data
Appear to be very common in the real world
Limpert, E, et al. 2001 BioScience 51 (5), 342-352
![Page 16: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/16.jpg)
DATA TRANSFORMATIONS
(1)
(2)
Environmental variable skewed to right
log-normal distribution
If SD > mean or maximum value of x > 20 times the smallest, use log (x + c) transformation where c is constant, usually 1.
Biological data - Stabilise variances
- Dampen effects of very abundant taxa
Choices - No transformation
- Square root
- Log (y + 1)
- % data square root
- Counts log (y + 1)
![Page 17: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/17.jpg)
Other transformations:
where λ 0 = log x where λ = 0
If x = 0.0, add 0.5 or 1.0 as constant
Can also solve for best estimate of constant to add
Can calculate confidence limits for λ.
If these include 1, no need for a transformation!
(1) square root (2) cubic root
(3) fourth root
(4) log2 log2 (x + 1)
(5) logp logp (x + 1)
(6) Box-Cox transformation - most appropriate value for exponent λ
3 x4 x
TRANSFOR
1 xx*
If = 1 no transformation
= 0.5 square root
= -1 reciprocal transformation
= 0 log transformation
![Page 18: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/18.jpg)
DENSITY ESTIMATION
A useful alternative to histograms is non-parametric density estimation which results in a smoothing of the histogram.
The kernel-density estimate at the value of x of a variable X is given by
where xj are the n observations of X, K is a kernel function (such as the normal density), and b is a bandwidth parameter influencing the amount of smoothing. Small bandwidths produce rough density estimates, whereas large bandwidths produce smoother estimates.
n
j
j
b
xxK
bxf
1
1)(̂
Note that the histogram has been scaled to the density estimates, not the raw frequencies.
![Page 19: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/19.jpg)
Multiple approaches
1. Histogram with density scaling (areas of histogram bars sum to 1)
2. Density estimation (default) (thick line)
3. Density estimation (half the default bin-width) (thin line)
4. One-dimensional scatter-plot ("rugplot") to show distribution of observations at the bottom
Fox, 2002
![Page 20: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/20.jpg)
QUANTILE-QUANTILE PLOTS
Quantile-quantile (Q-Q) plots are useful tools for determining if data are normally distributed. They show the relationship between the distribution of a variable and a reference or theoretical distribution.
Q-Q plot shows the relationship between the ordered data and the corresponding quantiles of the reference (in our case, normal) distribution.
If the data are normally distributed, they should plot on a straight line through the 1st and 3rd quartiles. If there is a break in slope of the plotted points, the data deviate from the reference distribution.
Note that quantiles are divisions of a frequency or probability distribution into equal, ordered subgroups (e.g. quartiles (4 parts) or percentiles (100 parts)).
![Page 21: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/21.jpg)
J.W. Tukey
(1) Stem-and-leaf displays
55 62 73 78 79 78 81
STEM5 56 27 3 8 8 98 1
LEAF
4 21 5 1 1 2 3 6 7
4 3 6 3 49 7 5 5 3 2 7 1
5 3 81 9
“back-to-back”
EXPLORATORY DATA ANALYSIS
GRAPHICAL DISPLAY
Univariate data
![Page 22: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/22.jpg)
(2) Box-and-whisker plots - box plots
CI around median 95%Median 1.58 (Q3) / (n)½
quartile
(3) Hanging histograms
![Page 23: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/23.jpg)
![Page 24: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/24.jpg)
Variations of box plots
McGill et al. Amer. Stat. 32, 12-16
![Page 25: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/25.jpg)
![Page 26: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/26.jpg)
Useful to label extreme points
Fox, 2002
![Page 27: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/27.jpg)
Box plots for samples of more than ten wing lengths of adult male winged blackbirds taken in winter at 12 localities in the southern United States, and in order of generally increasing latitude. From James et al. (1984a). Box plots give the median, the range, and upper and lower quartiles of the data.
![Page 28: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/28.jpg)
Useful to apply several approaches EDA tools
![Page 29: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/29.jpg)
• • • • • • • •
• •
• •
• • • • • •
• • •
•
• •
x2
x1
Bivariate and multivariate data
Simple scatter plot
![Page 30: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/30.jpg)
![Page 31: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/31.jpg)
SCATTERPLOT MATRIX. The data are measurements of ozone, solar radiation, temperature, and wind speed on 111 days. Thus the measurements are 111 points in a four-dimensional space. The graphical method in this figure is a scatterplot matrix: all pairwise scatterplots of the variables are aligned into a matrix with shared scales.
![Page 32: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/32.jpg)
Triangular arrangement of all pairwise scatter plots for four variables. Variables describe length and width of sepals and petals for 150 iris plants, comprising 3 species of 50 plants.
Three-dimensional perspective view for the first three variables of the iris data. Plants of the three species are coded A,B and C.
![Page 33: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/33.jpg)
Can explore scatter-plot by adding box-plots for each variable, add simple linear regression line, add smoother (LOWESS – see Lecture 5), and label particular points.
Fox, 2002
![Page 34: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/34.jpg)
Categorical variables can be encoded in a plot by using different symbols or colours for each category (e.g. type of occupation) and smoothers fitted for each category.
bc = blue collar, prof = professional, wc = white collar
Fox, 2002
![Page 35: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/35.jpg)
Jittering scatter-plots
Discrete quantitative variables usually result in uniformative scatter-plots (e.g. education (years) and vocabulary (score on 0-10 scale)).
Only 21 distinct education values and 11 scores, so only 21 x 11 = 231 plotting positions.
Jittering data adds a small random quantity to each value to try to separate over-plotted points. Can vary the amount of jittering and also plot a smoother. Fox, 2002
![Page 36: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/36.jpg)
Bivariate density estimation and scatter-plots
Large data-sets and weak relationships between variables.
Improve plot by jittering and making symbols smaller and apply bivariate kernel-density estimate plus regression line and LOWESS smoother.
Fox, 2002
![Page 37: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/37.jpg)
coal-fired power station
oil-fired power station
Diagonal = density estimate for each variable
![Page 38: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/38.jpg)
The Bagplot: A Bivariate Boxplot
Peter J. Rousseeuw
The American Statistician November 1999, Vol. 53, No. 4, 382
Car weight and engine displacement of 60 cars.
![Page 39: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/39.jpg)
Part (a) shows the concentrations of cholesterol and triglycerides in the plasma of 320 patients. In part (b) logarithms are taken of both variables.
Part (a) shows the altitudinal range and abundance of butterflies. In part (b) the logarithm of the abundance is plotted.
![Page 40: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/40.jpg)
Bagplot matrix of the three-dimensional aquifer data
with 85 data points.
![Page 41: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/41.jpg)
Conditioning plots (Co-plots)
Focus on relationship between response and a predictor variable, holding other predictors constant at particular values – conditionally fixing the values of other predictors. 'Statistical control'
Co-plots provide graphical statistical control.
Focus on particular predictor and set each other predictor to a relatively narrow range (if quantitative) or to a specific value (if categorical). Subranges for a quantitative predictor are typically set to overlap (called "shingles") rather than to partition data into disjoint subsets ("bins").
![Page 42: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/42.jpg)
For each combination of values of the conditioning predictors, construct scatter-plot to show response to the local predictor and arrange the plots in an array.
Can condition on more than one predictor (e.g. age, gender).
Six overlapping age classes, two genders (male upper, female lower), LOWESS, and linear fits
Fox, 2002
![Page 43: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/43.jpg)
EDA and Data-Transformations
Try to linearise non-linear relationships by trial-and-error.
Mosteller & Tukey's 'bulging rule'.
Fox, 2002
When bulge points down, transform y down the ladder of powers and roots;
when the bulge points up, transform y up,
when the bulge points left, transform x down;
when the bulge points right transform x up.
![Page 44: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/44.jpg)
Infant mortality rate and GDP per capita for 193 countries
Points down and to left, try powers and roots
Log transformation linearising, variables more symmetric
Fox, 2002
![Page 45: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/45.jpg)
Profiles, Stars, Glyphs, Faces, and Boxes of Percentages of Republican Votes in Six Presidential Elections in Six Southern States. The circles in the Stars Are Drawn at 50%. The Assignment of Variables to Facial Features in the Faces is: 1932 – Shape of Face; 1936 – Length of nose; 1940 – Curvature of Mouth; 1960 – Width of Mouth; 1964 – Slant of Eyes; 1968 – Length of Eyebrows
Simple multivariate data
![Page 46: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/46.jpg)
Three types of shape for representing multivariate data. In these examples glyph, stars and faces represent five, six and twelve (!) variables respectively.
Frequency of the six commonest species on the Park Grass plots using star displays.
![Page 47: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/47.jpg)
Polygon plots
Labelled polygon plot
![Page 48: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/48.jpg)
Chernoff faces
CHERNOFF
![Page 49: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/49.jpg)
MurderMan-
slaughterAtlanta 16.5 24.8 106 147 1112 905 494
Boston 4.2 13.3 122 90 982 669 954
Chicago 11.6 24.7 340 242 808 609 645
Dallas 18.1 34.2 184 293 1668 901 602
Denver 6.9 41.5 173 191 1534 1368 780
Detroit 13 35.7 477 220 1566 1183 788
Hartford 2.5 8.8 68 103 1017 724 468
Honolulu 3.6 12.7 42 28 1457 1102 637
Houston 16.8 26.6 289 186 1509 787 697
Kansas City 10.8 43.2 255 226 1494 955 765
Los Angeles 9.7 51.8 286 355 1902 1386 862
New Orleans 10.3 39.7 266 283 1056 1036 776
New York 9.4 19.4 522 267 1674 1392 848
Portland 5 23 157 144 1530 1281 488
Tucson 5.1 22.9 85 148 1206 756 483
Washington 1.5 27.6 524 217 1494 1003 739
Burglary Larceny Auto theftRape Robbery Assault
American city crime data
![Page 50: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/50.jpg)
1. Atlanta2. Boston3. Chicago4. Dallas5. Denver6. Detroit7. Hartford8. Honolulu9. Houston10. Kansas City11. Los
Angeles12. New
Orleans13. New York14. Portland 15. Tucson16. Washingto
n
Faces representation of city crime data
CHERNOFF
![Page 51: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/51.jpg)
Occurrence of seven vegetation groups at sites on cliffs of Snowdonia, from soils containing differing amounts of available phosphate and exchangeable calcium. The size of circles indicates the relative abundance of the vegetation.
![Page 52: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/52.jpg)
1932 1936 1940 1960 1964 1968Missouri 35 38 48 50 36 45Maryland 36 37 41 46 35 42Kentucky 40 40 42 54 36 44Louisiana 7 11 14 29 57 23Mississippi 4 3 4 25 87 14South Carolina 2 1 4 49 59 39
Percentage of Republican Votes in residential Elections in six Southern States in the Years 1932-1940, 1960-68.
A) Schematic representation of the hierar-chical clustering of years by complete link of republican vote data in six southern states. The numbers at the far left denote distances between clusters.
B) Tree for Missouri computed according to decisions (i) – (v)
Trees for republican vote data in six southern states. .
![Page 53: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/53.jpg)
Tree of yearly yields of 15 transportation companies with all variables labelled
Tree of yearly yields of 15 transportation companies 1953-1977
![Page 54: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/54.jpg)
FOURIER PLOTS Andrews (1972)
Plot multivariate data into a function. where data are [x1, x2, x3, x4, x5... xm] Plot over range -π ≤ t ≤ π Each object is a curve. Function preserves distances between objects. Similar objects will be plotted close together.
txtxtxtxxtxf 222 54321 cossincossin
MULTPLOT
Complex multivariate data
![Page 55: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/55.jpg)
Andrews' plot for artificial data
![Page 56: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/56.jpg)
Andrews’ plots for all twenty-two Indian
tribes.
![Page 57: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/57.jpg)
Dieldrin residues in the livers of 227 kestrels and barn owls found dead during 1970-1973. Each bird is represented by a point on the map. (Reproduced with permission from Institute of Terrestrial Ecology Annual Report for 1974).
OTHER TYPES OF GRAPHICAL DISPLAY
![Page 58: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/58.jpg)
Map of aerial density of Sitobion avenea, 11-17 June 1984 produced using the SYMAP program. Darker areas represent higher densities on a logarithmic scale (×3 intervals). Numbers on map indicate positions of suction traps and their respective catch sizes (log3). (Reproduced with
permission from Woiwod and Tatchell, 1984.)
![Page 59: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/59.jpg)
Contour map of the aerial density (using logarithmic intervals) of the hop aphid Phorodon humili 28 September to 2 October 1983, produced by the program SURFACE II. Suction trap sites are marked with a +. (Reproduced with permission from Fig. 3 of Woiwod and Tatchell, 1984)
![Page 60: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/60.jpg)
Three dimensional perspective view of the aphid densities obtained using SURFACE II. (Reproduced from Woiwod and Tatchell, 1984)
![Page 61: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/61.jpg)
THE POWER OF GRAPHICAL DATA DISPLAY. Visualization provides insight that cannot be appreciated by any other approach to learning from data. On this graph, the top left panel displays monthly average CO2 concentrations from Mauna Loa, Hawaii. The remaining panels show frequency components of variation in the data. The heights of the five bars on the right sides of the panels portray the same changes in ppm on the five vertical scales.
![Page 62: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/62.jpg)
Identification of ‘outliers’ or ‘rogues’.
“Observation which is, in some sense, inconsistent with the rest of the observations in the data-set. An observation can be an outlier due to the response variable(s) or any one or more of the predictor variables having values outside their expected limits.”
Identify not for rejection at this stage but for investigation and evaluation.
? Incorrect measurement, incorrect data entry, transcription or recording error.
Concept of outlier is model dependent.
OUTLIERS
LEVERAGE Potential for influence resulting from unusual values, particularly of predictor variables
INFLUENCE Observation is influential if its deletion substantially changes the results
![Page 63: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/63.jpg)
Generalised distance of observation i plus 1/n.
niii xxSxxd 1112
Measures how extreme the observation i is from the mean vector of complete
sample x.
If leverage of an observation is more than three times the average leverage, observation has high leverage. Need to check it and try to explain why it has high leverage.
Alternatively, leverage of observation i (hi) equals the diagonal element of hat
matrix H
x
H = X (X 1 X ) -1 X 1 where X is n x k matrix of x values (i.e. the number of parameters in model), H
is n x n square matrix.
[Hat matrix so called because it puts “hat on Y”
Ŷ= HY where Ŷ and Y are n x 1 vectors of predicted and observed Y values]
di2 - two or more response variables (e.g. CANOCO)
hi - one response variable (e.g. linear or multiple regression)
LEVERAGE MEASURES
![Page 64: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/64.jpg)
Leverage ranges from 1/n to 1
Sample mean ĥi = k/n
Size-adjusted cut-off ĥi 2k/n (ca. extreme 5%)
Maximum (hi)
Max (hi) 0.2 Safe
0.2 < Max (hi) 0.5 Risky
Max (hi) > 0.5 Avoid if possible
k = number of parameters
As hi approaches 1, observation i may completely control
the model.
![Page 65: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/65.jpg)
DFBETAS - change in standard errors if observation i is deleted
kie
ikkik RSSs
bb DFBETAS
slope of regression slope when i deleted
residual standard deviation when i deleted
residual sum of squares when i not deleted
If DFBETASik > 0, case
i pulls bk up
< 0, case i pulls bk down
influential
case
nik2DFBETASIf
DFBETAS
identifies influence of observations on individual regression coefficients to model “LOCAL”
INFLUENCE MEASURES
![Page 66: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/66.jpg)
iii
i hkhz
D
1
2
standardised residual
number of parameters leverage measure from H
If Di > 1 observation influential
(size adjusted), observation
influential
D ni 4
High leverage - potential outlier
Low influence - good outliernon-discordant outlier
High influence - bad outlierdiscordant outlier
COOK’S D
assesses impact of observations on regression coefficients “GLOBAL”
COOK'S D
![Page 67: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/67.jpg)
‘Good’ (left) and ‘bad’ (right) outliers: ‘bad’ outliers influence the slope (artificial data)
Leverage (depends of x values only)
hi 0.34 0.34
(‘risky’ (between 0.2 and 0.5) and well above size-adjusted cut-off of 2k/n = 4/100 = 0.04)
Influence
DFBETASi = 0.06 -9.1
(much less than 2/√n = 0.2) (much more than 2/√n = 0.2)
High leverage, low influence High leverage, high influence
‘Good’ outlier ‘Bad’ outlier
Non-discordant outlier Discordant outlier
![Page 68: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/68.jpg)
Robust leverage vs. Robust residuals plot
![Page 69: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022081515/56649d155503460f949eb03e/html5/thumbnails/69.jpg)
NEVER FORGET THE GRAPH!
“What is the use of a book, thought Alice, without pictures”
SOFTWARE FOR EXPLORATORY DATA ANALYSIS
R and S–PLUS
MINITAB
SYSTAT
AXUM