Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth +...
Transcript of Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth +...
Univariate EDA
(Exploratory Data Analysis)
EDA• John Tukey (1970s)
• data– two components:
• smooth + rough
• patterned behaviour + random variation
• resistant measures/displays– little influenced by changes in a small proportion of the total
number of cases
– resistant to the effects of outliers
– emphasizes smooth over rough components
• concepts apply to statistics and to graphical methods
Tree Ring dates (AD)
1255 1239 1162 1239 1240 1243 1241 1241 1271
• 9 dendrochronology dates
• what do they mean????
• usually helps to sort the data…
Stem-and-Leaf Diagram
1162 1239 1239 1240 1241 1241 1243 1255 1271
11|62
12|39,39,40,41,41,43,55,71
• original values preserved
• no rounding, no loss of information…
can simplify in various ways…
11|6
12|44444467
– ‘leaves’ rounded to nearest decade
– ‘stem’ based on centuries
1162 1239 1239 1240 1241 1241 1243 1255 1271
116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1
‘stem’ based on decades…
1162 1239 1239 1240 1241 1241 1243 1255 1271
116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1
highlights existence of gaps in the distribution of dates, groups of dates…
R• stem()
• vuround(runif(25, 0, 50),0); stem(vu)
• vnround(rnorm(25, 25, 10),0); stem(vn)
• stem(vn, scale=2)
unit 1 unit 2
12.6 16.2
11.6 16.4
16.3 13.8
13.1 13.2
12.1 11.3
26.9 14
9.7 9
11.5 12.5
14.8 15.6
13.5 11.2
12.4 12.2
13.6 15.5
11.7
9 26
25
24
23
22
21
20
19
18
17
3 16 24
15 56
8 14 0
651 13 28
641 12 25
65 11 237
10
7 9 0
unit 1 unit 2
Back-to-back stem-and-leaf plot
rimdiameterdata (cm)
percentiles
• useful for constructing various kinds of EDA graphics
• don’t confuse percentile with percent or proportion
Note:• frequency = count• relative frequency = percent or proportion
percentiles
“the pth percentile of a distribution: number such that approximately p percent of the
values in the distribution are equal or less than that number…”
• can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…
percentiles
• sort the data so that x1 is the smallest value, and xn is the largest (where n=total number of cases)
• xi is the pith percentile of a dataset of n members where:
n
ipi
5.0100
original data:
5 1 9 3 14 9 7
sorted data:
x i 1 3 5 7 9 9 14
i 1 2 3 4 5 6 7p i (calculate, using equation [1], as shown below…)
p1 = 100(1 - 0.5) / 7 = 7.1p2 = 100(2 - 0.5) / 7 = 21.4p3 = 100(3 - 0.5) / 7 = 35.7p4 = 100(4 - 0.5) / 7 = 50etc…
x i 1 3 5 7 9 9 14
i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9
n
ipi
5.0100
[1]
n
ipi
5.0100
5.0
100 inp
i
x i 1 3 5 7 9 9 14
i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9
25
?
85
?
50
50th percentile:i=(7*50)/100 + .5i=4, xi=7
25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5
x i 1 3 5 7 9 9 14
i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9
?
if i < > integer, then…k = integer part of i; f = fractional part of ixint = interpolated value of xxint = (1-f)xk + fxk+1
xint= (1-.25)*3+.25*5xint= 3.5
25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5
25
use R!!
• test<-c(1,3,5,7,9,9,14)
• quantile(test, .25, type=5)
75th25th 50thpercentiles:
interquartilerange
(midspread)
upper hingelower hinge inner fenceinner fence
“boxplot”63 5885 4795 3344 393 117 11
80 526 1962 320 4286 3752 9055 8664 283 27
65 6046 4129 5596 8982 9066 6399 8326 3295 7276 9746 6765 8184 75
(1.5 x midspread)
Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]
Cleveland, W. S. (1985) The Elements of Graphing Data.
Histograms
• divide a continuous variable into intervals called ‘bins’
• count the number of cases within each bin
• use bars to reflect counts
• intervals on the horizontal axis
• counts on the vertical axis
“bins”
Histogram
coun
ts percent63 5885 4795 3344 393 117 11
80 526 1962 320 4286 3752 9055 8664 283 27
65 6046 4129 5596 8982 9066 6399 8326 3295 7276 9746 6765 8184 75
• useful for illustrating the shape of the distribution of a batch of numbers
• may be helpful for identifying modes and modal behaviour
Histograms
mode
mode?
mode!
• the distribution is clearly bimodal
• may be multimodal…
important variables in histogram constuction:
• bin width• bin starting point
smoothing histograms
• may want to accentuate the ‘smooth’ in a data distribution…
• calculate “running averages” on bin counts• level of smoothing is arbitrary…
1 3 5 2 4 2 0 1
2 3 3.3 3.6 2.6 2 1 0.5
0
1
2
3
4
5
6
0
0.5
1
1.5
2
2.5
3
3.5
4
histogram / barchart variations
• 3d
• stacked
• dual
• frequency polygon
• kernel density methods
bear
carib
ou
muskox
seal
walrus
FAUNA
0
10
20
30
40
50
60
70
80
Co
un
t
21
SITE
dual barchart
1
bear
caribou
muskoxseal
walrus
FAUNA
0
10
20
30
40
50
60
70
Cou
nt
2
bear
caribou
muskoxseal
walrus
FAUNA
0
10
20
30
40
50
60
70
Cou
nt
Site 1 Site 2
01020304050607080
Co
un
t
bear
carib
ou
muskox
seal
walrus
FAUNA
01020304050607080
Co
un
t
21
SITE
‘mirror’ barchart
0
10
20
30
40
50
60
70
80
bear caribou muskox seal walrus
Site 2
Site 1
stacked barchart
bearcaribou
muskoxseal
walrus
Site 1
Site 20
10
20
30
40
50
60
70
Site 1
Site 2
3d barchart
frequency polygon
Histogram of vol
vol
De
nsi
ty
100 200 300 400
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
kernel density modelHistogram of vol
vol
De
nsi
ty
100 200 300 400
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
controlling kernel density plots…
• hd <- density(XX)• hh <- hist(XX, plot=F)
• maxD <- max(hd$y)• maxH <- max(hh$density)• Y <- c(0, max(c(maxD, maxH)))
• hist(XX, freq=F, ylim=Y)• lines(density(XX))
1 2 3 4 5 6 7 8 9 10VAR00003
0
1
2
3
4
5
6
7
8
Cou
nt
Dot Plot [R: dotchart()]
bear
carib
ou
muskox
seal
walrus
FAUNA
0
10
20
30
40
50
60
70
80
Co
un
t
21
SITE
Dot Histogram [R: stripchart()]
1 2 3 4 5 6 7 8 9 10VAR00003
1 2 3 4 5 6 7 8 9 10VAR00003
1 2 3 4 5 6 7 8 9 10VAR00003
method = “stack”
cooking/service service ritual
line plot
cooking/service service ritual
bear
carib
ou
muskox
seal
walrus
FAUNA
0
10
20
30
40
50
60
70
80
Cou
nt
21
SITE
bear
caribou
cat
elk
moose
20%
19%
18%
21%
22%
bear
caribou
cat
elk
moose
pie chart
1
bear
caribou
catelk
moose
2
bear
caribou
cat
elk
moose
10
20
30
40
50
60
70
80
90
100
perc
ent
10
20
30
40
50
60
70
80
90
100
cum
ulat
ive
perc
ent
Cumulative Percent Graph
10
20
30
40
50
60
70
80
90
100
cum
ulat
ive
perc
ent
• some useful statistical measures
(ordinal or ratio scale)
• can be misleading when used with nominal data
• good for comparing data sets
Cumulative Percent Graph
PercentagesSitesA B C
Types 1 5 5 52 45 0 303 5 48 54 5 5 55 5 5 56 5 5 57 20 5 358 5 22 59 5 5 5
100 100 100
Cumulative PercentsSitesA B C
Types 1 5 5 52 50 5 353 55 53 404 60 58 455 65 63 506 70 68 557 90 73 908 95 95 959 100 100 100
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
A
B
C
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
A
B
C
0
20
40
60
80
100
120
1 5 3 4 2 6 7 8 9
A
B
C