Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for...

90
Exploratory Data Analysis What to do with a dataset before modeling using Statistics or Machine Learning. Better understand the data at hand, help us make decisions about appropriate modeling methods, helpful data transformations that may be helpful to do. 1 / 89 Exploratory Data Analysis There are many instances where statistical data modeling is not required to tell a clear and convincing story with data. Many times an effective visualization can lead to convincing conclusions. 2 / 89 Exploratory Data Analysis Goal Perform an initial exploration of attributes/variables across entities/observations. We will concentrate on exploration of single or pairs of variables. Later on in the course we will see dimensionality reduction methods that are useful in exploration of more than two variables at a time. 3 / 89 Exploratory Data Analysis Computing summary statistics how to interpret them understand properties of attributes. Data transformations change properties of variables to help in visualization or modeling. First, how to use visualization for exploratory data analysis. 4 / 89 Exploratory Data Analysis Ultimately, the purpose of EDA is to spot problems in data (as part of data wrangling) and understand variable properties like: central trends (mean) spread (variance) skew outliers This will help us think of possible modeling strategies (e.g., probability distributions) 5 / 89 flights %>% sample_frac(.1) %>% rowid_to_column() %>% ggplot(aes(x=rowid, y=dep_delay)) + geom_point() Visualization of single variables 6 / 89 flights %>% sample_frac(.1) %>% arrange(dep_delay) %>% rowid_to_column() %>% ggplot(aes(x=rowid, y=dep_delay)) + geom_point() Visualization of single variables 7 / 89 Visualization of single variables What can we make of that plot now? Start thinking of central tendency, spread and skew as you look at that plot. Let's now create a graphical summary of that variable to incorporate observations made from this initial plot. Let's start with a histogram: it divides the range of the dep_delay attribute into equalsized bins, then plots the number of observations within each bin. 8 / 89 flights %>% ggplot(aes(x=dep_delay)) + geom_histogram() Visualization of single variables 9 / 89 Visualization of single variables Density plot We can (conceptually) make the bins as small as possible and get a smooth curve that describes the distribution of values of the dep_delay variable. 10 / 89 flights %>% ggplot(aes(x=dep_delay)) + geom_density() Visualization of single variables 11 / 89 Visualization of single variables Boxplot Succint graphical summary of the distribution of a variable. 12 / 89 flights %>% ggplot(aes(x='',y=dep_delay)) + geom_boxplot() Visualization of single variables 13 / 89 Visualization of single variables That's not very clear to see, so let's do a logarithmic transformation of this data to see distribution better. 14 / 89 Visualization of single variables flights %>% mutate(min_delay=min(dep_delay, na.rm=TRUE mutate(log_dep_delay = log(dep_delay - min ggplot(aes(x='', y=log_dep_delay)) + geom_boxplot() 15 / 89 Visualization of single variables So what does this represent? (a) central tendency (using the median) is represented by the black line within the box, (b) spread (using interquartile range) is represented by the box and whiskers. (c) outliers (data that is unusually outside the spread of the data) 16 / 89 Visualization of pairs of variables How do each of the distributional properties we care about (central trend, spread and skew) of the values of an attribute change based on the value of a different attribute? Suppose we want to see the relationship between dep_delay,a numeric variable, and origin,a categorical variable. 17 / 89 Visualization of pairs of variables Previously, we saw used group_bysummarize operations to compute attribute summaries based on the value of another attribute. We also called this conditioning. In visualization we can start thinking about conditioning as we saw before. Here is how we can see a plot of the distribution of departure delays conditioned on origin airport. 18 / 89 Visualization of pairs of variables flights %>% mutate(min_delay = min(dep_delay, na.rm=TR mutate(log_dep_delay = log(dep_delay - min ggplot(aes(x=origin, y=log_dep_delay)) + geom_boxplot() 19 / 89 Visualization of pairs of variables For pairs of continuous variables, the most useful visualization is the scatter plot. This gives an idea of how one variable varies (in terms of central trend, variance and skew) conditioned on another variable. 20 / 89 flights %>% sample_frac(.1) %>% ggplot(aes(x=dep_delay, y=arr_delay)) + geom_point() Visualization of pairs of variables 21 / 89 EDA with the grammar of graphics While we have seen a basic repertoire of graphics it's easier to proceed if we have a bit more formal way of thinking about graphics and plots. The central premise is to characterize the building pieces behind plots: 1. The data that goes into a plot, works best when data is tidy 2. The mapping between data and aesthetic attributes 3. The geometric representation of these attributes 22 / 89 batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=R)) + geom_point() EDA with the grammar of graphics 23 / 89 EDA with the grammar of graphics Data: Batting table filtering for year Aesthetic attributes: xaxis mapped to variables AB yaxis mapped to variable R Geometric Representation: points! Now, you can cleanly distinguish the constituent parts of the plot. 24 / 89 batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=R, label=teamID)) + geom_text() EDA with the grammar of graphics E.g., change the geometric representation 25 / 89 # scatter plot of at bats vs. runs for 1995 batting %>% filter(yearID == "1995") %>% ggplot(aes(x=AB, y=R)) + geom_point() EDA with the grammar of graphics E.g., change the data. 26 / 89 # scatter plot of at bats vs. hits for 2010 batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=H)) + geom_point() EDA with the grammar of graphics E.g., change the aesthetic. 27 / 89 EDA with the grammar of graphics Let's make a line plot What do we change? (data, aesthetic or geometry?) 28 / 89 batting %>% filter(yearID == "2010") %>% sample_n(100) %>% ggplot(aes(x=AB, y=H)) + geom_line() EDA with the grammar of graphics 29 / 89 EDA with the grammar of graphics Let's add a regression line What do we add? (data, aesthetic or geometry?) 30 / 89 batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=H)) + geom_point() + geom_smooth(method=lm) EDA with the grammar of graphics What can we see about central trend, variation and skew with this plot? 31 / 89 Color: color by categorical variable batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=H, color=lgID)) + geom_point() + geom_smooth(method=lm) EDA with the grammar of graphics Using other aesthetics we can incorporate information from other variables. 32 / 89 Size: size by (continuous) numeric variable batting %>% filter(yearID == "2010") %>% ggplot(aes(x=AB, y=R, size=HR)) + geom_point() + geom_smooth(method=lm) EDA with the grammar of graphics 33 / 89 EDA with the grammar of graphics Faceting The last major component of exploratory analysis called faceting in visualization, corresponds to conditioning in statistical modeling, we've seen it as the motivation of grouping when wrangling data. 34 / 89 EDA with the grammar of graphics batting %>% filter(yearID %in% c("1995", "2000", "2010 ggplot(aes(x=AB, y=R, size=HR)) + facet_grid(lgID~yearID) + geom_point() + geom_smooth(method=lm) 35 / 89 Exploratory Data Analysis: Summary Statistics Let's continue our discussion of Exploratory Data Analysis. In the previous section we saw ways of visualizing attributes (variables) using plots to start understanding properties of how data is distributed. In this section, we start discussing statistical summaries of data to quantify properties that we observed using visual summaries and representations. 36 / 89 Exploratory Data Analysis: Summary Statistics Remember that one purpose of EDA is to spot problems in data (as part of data wrangling) and understand variable properties like: central trends (mean) spread (variance) skew suggest possible modeling strategies (e.g., probability distributions) 37 / 89 Exploratory Data Analysis: Summary Statistics One last note on EDA. John W. Tukey was an exceptional scientist/mathematician, who had profound impact on statistics and Computer Science. A lot of what we cover in EDA is based on his groundbreaking work. https://www.stat.berkeley.edu/~brill/Papers/life.pdf. 38 / 89 Exploratory Data Analysis: Summary Statistics Range Part of our goal is to understand how variables are distributed in a given dataset. Note, again, that we are not using distributed in a formal mathematical (or probabilistic) sense. All statements we are making here are based on data at hand, so we could refer to this as the empirical distribution of data. 39 / 89 Exploratory Data Analysis: Summary Statistics Let's use a dataset on diamond characteristics as an example. 40 / 89 Exploratory Data Analysis: Summary Statistics Notation We assume that we have data across entitites (or observational units) for attributes. In this dataset and . However, let's consider a single attribute, and denote the data for that attribute (or variable) as . 41 / 89 Exploratory Data Analysis: Summary Statistics Since we want to understand how data is distributed across a range, we should first define the range. diamonds %>% summarize(min_depth = min(depth), max_depth = max(depth)) ## # A tibble: 1 x 2 ## min_depth max_depth ## <dbl> <dbl> ## 1 43 79 42 / 89 Exploratory Data Analysis: Summary Statistics We use notation and to denote the minimum and maximum statistics. In general, we use notation for the rank statistics, e.g., the th largest value in the data. 43 / 89 Exploratory Data Analysis: Summary Statistics Central Tendency Now that we know the range over which data is distributed, we can figure out a first summary of data is distributed across this range. Let's start with the center of the data: the median is a statistic defined such that half of the data has a smaller value. We can use notation (a rank statistic) to represent the median. 44 / 89 Exploratory Data Analysis: Summary Statistics 45 / 89 Exploratory Data Analysis: Summary Statistics Derivation of the mean as central tendency statistic Best known statistic for central tendency is the mean, or average of the data: . It turns out that in this case, we can be a bit more formal about "center" means in this case. Let's say that the center of a dataset is a point in the range of the data that is close to the data. To say that something is close we need a measure of distance. 46 / 89 Exploratory Data Analysis: Summary Statistics So for two points and what should we use for distance? The distance between data point and is . 47 / 89 Exploratory Data Analysis: Summary Statistics So, to define the center, let's build a criterion based on this distance by adding this distance across all points in our dataset: Here RSS means residual sum of squares, and we to stand for candidate values of center. 48 / 89 Exploratory Data Analysis: Summary Statistics We can plot RSS for different values of : 49 / 89 Exploratory Data Analysis: Summary Statistics Now, what should our "center" estimate be? We want a value that is close to the data based on RSS! So we need to find the value in the range that minimizes RSS. 50 / 89 Exploratory Data Analysis: Summary Statistics From calculus, we know that a necessary condition for the minimizer of RSS is that the derivative of RSS is zero at that point. So, the strategy to minimize RSS is to compute its derivative, and find the value of where it equals zero. 51 / 89 Exploratory Data Analysis: Summary Statistics 52 / 89 Exploratory Data Analysis: Summary Statistics 53 / 89 Exploratory Data Analysis: Summary Statistics Next, we set that equal to zero and find the value of that solves that equation: 54 / 89 Exploratory Data Analysis: Summary Statistics The fact you should remember: The mean is the value that minimizes RSS for a vector of attribute values 55 / 89 Exploratory Data Analysis: Summary Statistics It equals the value where the derivative of RSS is 0: 56 / 89 Exploratory Data Analysis: Summary Statistics It is the value that minimizes RSS: 57 / 89 Exploratory Data Analysis: Summary Statistics And it serves as an estimate of central tendency of the dataset: 58 / 89 Exploratory Data Analysis: Summary Statistics Note that in this dataset the mean and median are not exactly equal, but are very close: diamonds %>% summarize(mean_depth = mean(depth), median_depth = median(depth)) ## # A tibble: 1 x 2 ## mean_depth median_depth ## <dbl> <dbl> ## 1 61.7 61.8 59 / 89 Exploratory Data Analysis: Summary Statistics There is a similar argument to define the median as a measure of center. In this case, instead of using RSS we use a different criterion: the sum of absolute deviations The median is the minimizer of this criterion. 60 / 89 Exploratory Data Analysis: Summary Statistics 61 / 89 Exploratory Data Analysis: Summary Statistics Spread Now that we have a measure of center, we can now discuss how data is spread around that center. 62 / 89 Exploratory Data Analysis: Summary Statistics Variance For the mean, we have a convenient way of describing this: the average distance (using squared difference) from the mean. We call this the variance of the data: 63 / 89 Exploratory Data Analysis: Summary Statistics You will also see it with a slightly different constant in the front for technical reasons that we may discuss later on: 64 / 89 Exploratory Data Analysis: Summary Statistics Variance is a commonly used statistic for spread but it has the disadvantage that its units are not easy to conceptualize (e.g., squared diamond depth). A spread statistic that is in the same units as the data is the standard deviation, which is just the squared root of variance: 65 / 89 Exploratory Data Analysis: Summary Statistics We can also use standard deviations as an interpretable unit of how far a given data point is from the mean: 66 / 89 Exploratory Data Analysis: Summary Statistics As a rough guide, we can use "standard deviations away from the mean" as a measure of spread as follows: SDs proportion Interpretation 1 0.68 68% of the data is within 1 sds 2 0.95 95% of the data is within 2 sds 3 0.9973 99.73% of the data is within 3 sds 4 0.999937 99.9937% of the data is within 4 sds 5 0.9999994 99.999943% of the data is within 5 sds 6 1 99.9999998% of the data is within 6 sds 67 / 89 Exploratory Data Analysis: Summary Statistics Spread estimates using rank statistics Just like we saw how the median is a rank statistic used to describe central tendency, we can also use rank statistics to describe spread. For this we use two more rank statistics: the first and third quartiles, and respectively. 68 / 89 Exploratory Data Analysis: Summary Statistics 69 / 89 Exploratory Data Analysis: Summary Statistics Note, the five order statistics we have seen so far: minimum, maximum, median and first and third quartiles are so frequently used that this is exactly what R uses by default as a summary of a numeric vector of data (along with the mean): summary(diamonds$depth) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 43.00 61.00 61.80 61.75 62.50 79.00 70 / 89 Exploratory Data Analysis: Summary Statistics This fivenumber summary are also all of the statistics used to construct a boxplot to summarize data distribution. In particular, the interquartile range, which is defined as the difference between the third and first quartile: gives a measure of spread. 71 / 89 Exploratory Data Analysis: Summary Statistics The interpretation here is that half the data is within the IQR around the median. diamonds %>% summarize(sd_depth = sd(depth), iqr_depth = IQR(depth)) ## # A tibble: 1 x 2 ## sd_depth iqr_depth ## <dbl> <dbl> ## 1 1.43 1.5 72 / 89 Exploratory Data Analysis: Summary Statistics Outliers We can use estimates of spread to identify outlier values in a dataset. Given an estimate of spread based on the techniques we've just seen, we can identify values that are unusually far away from the center of the distribution. 73 / 89 Exploratory Data Analysis: Summary Statistics One often cited rule of thumb is based on using standard deviation estimates. We can identify outliers as the set where is the sample mean of the data and it's standard deviation. Multiplier determines if we are identifying (in Tukey's nomenclature) outliers or points that are far out. 74 / 89 Exploratory Data Analysis: Summary Statistics 75 / 89 Exploratory Data Analysis: Summary Statistics While this method works relatively well in practice, it presents a fundamental problem. Severe outliers can significantly affect spread estimates based on standard deviation. Specifically, spread estimates will be inflated in the presence of severe outliers. 76 / 89 Exploratory Data Analysis: Summary Statistics To circumvent this problem, we use rankbased estimates of spread to identify outliers as: This is usually referred to as the Tukey outlier rule, with multiplier serving the same role as before. 77 / 89 Exploratory Data Analysis: Summary Statistics We use the IQR here because it is less susceptible to be inflated by severe outliers in the dataset. It also works better for skewed data than the method based on standard deviation. 78 / 89 Exploratory Data Analysis: Summary Statistics 79 / 89 Exploratory Data Analysis: Summary Statistics Skew The fivenumber summary can be used to understand if data is skewed. Consider the differences between the first and third quartiles to the median. 80 / 89 Exploratory Data Analysis: Summary Statistics diamonds %>% summarize(med_depth = median(depth), q1_depth = quantile(depth, 1/4), q3_depth = quantile(depth, 3/4)) %>% mutate(d1_depth = med_depth - q1_depth, d2_depth = q3_depth - med_depth) %>% select(d1_depth, d2_depth) ## # A tibble: 1 x 2 ## d1_depth d2_depth ## <dbl> <dbl> ## 1 0.800 0.7 81 / 89 Exploratory Data Analysis: Summary Statistics If one of these differences is larger than the other, then that indicates that this dataset might be skewed. The range of data on one side of the median is longer (or shorter) than the range of data on the other side of the median. 82 / 89 Exploratory Data Analysis: Summary Statistics Covariance and correlation The scatter plot is a visual way of observing relationships between pairs of variables. Like descriptions of distributions of single variables, we would like to construct statistics that summarize the relationship between two variables quantitatively. To do this we will extend our notion of spread (or variation of data around the mean) to the notion of covariation: do pairs of variables vary around the mean in the same way. 83 / 89 Exploratory Data Analysis: Summary Statistics Consider now data for two variables over the same entities: . For example, for each diamond, we have carat and price as two variables. 84 / 89 Exploratory Data Analysis: Summary Statistics 85 / 89 Exploratory Data Analysis: Summary Statistics We want to capture the relationship: does vary in the same direction and scale away from its mean as ? This leads to covariance 86 / 89 Exploratory Data Analysis: Summary Statistics Just like variance, we have an issue with units and interpretation for covariance, so we introduce correlation (formally, Pearson's correlation coefficient) to summarize this relationship in a unitless way: 87 / 89 Exploratory Data Analysis: Summary Statistics As before, we can also use rank statistics to define a measure of how two variables are associated. One of these, Spearman correlation is commonly used. It is defined as the Pearson correlation coefficient of the ranks (rather than actual values) of pairs of variables. 88 / 89 Exploratory Data Analysis: Summary Statistics Summary EDA: visual and computational methods to describe the distribution of data attributes over a range of values Grammar of graphics as effective tool for visual EDA Statistical summaries that directly establish properties of data distribution 89 / 89 Introduction to Data Science: Exploratory Data Analysis Héctor Corrada Bravo University of Maryland, College Park, USA 20200304

Transcript of Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for...

Page 1: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisWhat to do with a dataset before modeling using Statistics or MachineLearning.

Better understand the data at hand,help us make decisions about appropriate modeling methods,helpful data transformations that may be helpful to do.

1 / 89

Exploratory Data AnalysisThere are many instances where statistical data modeling is not requiredto tell a clear and convincing story with data.

Many times an effective visualization can lead to convincing conclusions.

2 / 89

Exploratory Data AnalysisGoal Perform an initial exploration of attributes/variables acrossentities/observations.

We will concentrate on exploration of single or pairs of variables.

Later on in the course we will see dimensionality reduction methods thatare useful in exploration of more than two variables at a time.

3 / 89

Exploratory Data AnalysisComputing summary statistics

how to interpret themunderstand properties of attributes.

Data transformations

change properties of variables to help in visualization or modeling.

First, how to use visualization for exploratory data analysis.

4 / 89

Exploratory Data AnalysisUltimately, the purpose of EDA is to spot problems in data (as part ofdata wrangling) and understand variable properties like:

central trends (mean)spread (variance)skewoutliers

This will help us think of possible modeling strategies (e.g., probabilitydistributions)

5 / 89

flights %>%

sample_frac(.1) %>%

rowid_to_column() %>%

ggplot(aes(x=rowid, y=dep_delay)) +

geom_point()

Visualization of single variables

6 / 89

flights %>%

sample_frac(.1) %>%

arrange(dep_delay) %>%

rowid_to_column() %>%

ggplot(aes(x=rowid, y=dep_delay)) +

geom_point()

Visualization of single variables

7 / 89

Visualization of single variablesWhat can we make of that plot now? Start thinking of central tendency,spread and skew as you look at that plot.

Let's now create a graphical summary of that variable to incorporateobservations made from this initial plot.

Let's start with a histogram: it divides the range of the dep_delayattribute into equal­sized bins, then plots the number of observationswithin each bin.

8 / 89

flights %>%

ggplot(aes(x=dep_delay)) +

geom_histogram()

Visualization of single variables

9 / 89

Visualization of single variablesDensity plot

We can (conceptually) make the bins as small as possible and get asmooth curve that describes the distribution of values of the dep_delayvariable.

10 / 89

flights %>%

ggplot(aes(x=dep_delay)) +

geom_density()

Visualization of single variables

11 / 89

Visualization of single variablesBoxplot Succint graphical summary of the distribution of a variable.

12 / 89

flights %>%

ggplot(aes(x='',y=dep_delay)) +

geom_boxplot()

Visualization of single variables

13 / 89

Visualization of single variablesThat's not very clear to see, so let's do a logarithmic transformation ofthis data to see distribution better.

14 / 89

Visualization of single variables

flights %>%

mutate(min_delay=min(dep_delay, na.rm=TRUE

mutate(log_dep_delay = log(dep_delay - min

ggplot(aes(x='', y=log_dep_delay)) +

geom_boxplot()

15 / 89

Visualization of single variablesSo what does this represent?

(a) central tendency (using the median) is represented by the black linewithin the box,

(b) spread (using inter­quartile range) is represented by the box andwhiskers.

(c) outliers (data that is unusually outside the spread of the data)

16 / 89

Visualization of pairs of variablesHow do each of the distributional properties we care about (central trend,spread and skew) of the values of an attribute change based on thevalue of a different attribute?

Suppose we want to see the relationship between dep_delay, anumeric variable, and origin, a categorical variable.

17 / 89

Visualization of pairs of variablesPreviously, we saw used group_by­summarize operations to computeattribute summaries based on the value of another attribute.

We also called this conditioning. In visualization we can start thinkingabout conditioning as we saw before.

Here is how we can see a plot of the distribution of departure delaysconditioned on origin airport.

18 / 89

Visualization of pairs of variables

flights %>%

mutate(min_delay = min(dep_delay, na.rm=TR

mutate(log_dep_delay = log(dep_delay - min

ggplot(aes(x=origin, y=log_dep_delay)) +

geom_boxplot()

19 / 89

Visualization of pairs of variablesFor pairs of continuous variables, the most useful visualization is thescatter plot.

This gives an idea of how one variable varies (in terms of central trend,variance and skew) conditioned on another variable.

20 / 89

flights %>%

sample_frac(.1) %>%

ggplot(aes(x=dep_delay, y=arr_delay)) +

geom_point()

Visualization of pairs of variables

21 / 89

EDA with the grammar of graphicsWhile we have seen a basic repertoire of graphics it's easier to proceedif we have a bit more formal way of thinking about graphics and plots.

The central premise is to characterize the building pieces behind plots:

1. The data that goes into a plot, works best when data is tidy2. The mapping between data and aesthetic attributes3. The geometric representation of these attributes

22 / 89

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R)) +

geom_point()

EDA with the grammar of graphics

23 / 89

EDA with the grammar of graphicsData: Batting table filtering for year Aesthetic attributes:

x­axis mapped to variables ABy­axis mapped to variable R

Geometric Representation: points!

Now, you can cleanly distinguish the constituent parts of the plot.

24 / 89

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R, label=teamID)) +

geom_text()

EDA with the grammar of graphicsE.g., change the geometric representation

25 / 89

# scatter plot of at bats vs. runs for 1995

batting %>%

filter(yearID == "1995") %>%

ggplot(aes(x=AB, y=R)) +

geom_point()

EDA with the grammar of graphicsE.g., change the data.

26 / 89

# scatter plot of at bats vs. hits for 2010

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H)) +

geom_point()

EDA with the grammar of graphicsE.g., change the aesthetic.

27 / 89

EDA with the grammar of graphicsLet's make a line plot

What do we change? (data, aesthetic or geometry?)

28 / 89

batting %>%

filter(yearID == "2010") %>%

sample_n(100) %>%

ggplot(aes(x=AB, y=H)) +

geom_line()

EDA with the grammar of graphics

29 / 89

EDA with the grammar of graphicsLet's add a regression line

What do we add? (data, aesthetic or geometry?)

30 / 89

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphicsWhat can we see about central trend, variation and skew with this plot?

31 / 89

Color: color by categorical variable

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H, color=lgID)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphicsUsing other aesthetics we can incorporate information from othervariables.

32 / 89

Size: size by (continuous) numericvariable

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R, size=HR)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphics

33 / 89

EDA with the grammar of graphics

Faceting

The last major component of exploratory analysis called faceting invisualization, corresponds to conditioning in statistical modeling,we've seen it as the motivation of grouping when wrangling data.

34 / 89

EDA with the grammar of graphics

batting %>%

filter(yearID %in% c("1995", "2000", "2010

ggplot(aes(x=AB, y=R, size=HR)) +

facet_grid(lgID~yearID) +

geom_point() +

geom_smooth(method=lm)

35 / 89

Exploratory Data Analysis: Summary StatisticsLet's continue our discussion of Exploratory Data Analysis.

In the previous section we saw ways of visualizing attributes (variables)using plots to start understanding properties of how data is distributed.

In this section, we start discussing statistical summaries of data toquantify properties that we observed using visual summaries andrepresentations.

36 / 89

Exploratory Data Analysis: Summary StatisticsRemember that one purpose of EDA is to spot problems in data (as partof data wrangling) and understand variable properties like:

central trends (mean)spread (variance)skewsuggest possible modeling strategies (e.g., probability distributions)

37 / 89

Exploratory Data Analysis: Summary StatisticsOne last note on EDA.

John W. Tukey was an exceptional scientist/mathematician, who hadprofound impact on statistics and Computer Science.

A lot of what we cover in EDA is based on his groundbreaking work.

https://www.stat.berkeley.edu/~brill/Papers/life.pdf.

38 / 89

Exploratory Data Analysis: Summary Statistics

RangePart of our goal is to understand how variables are distributed in a givendataset.

Note, again, that we are not using distributed in a formal mathematical(or probabilistic) sense.

All statements we are making here are based on data at hand, so wecould refer to this as the empirical distribution of data.

39 / 89

Exploratory Data Analysis: Summary StatisticsLet's use a dataset on diamond characteristics as an example.

40 / 89

Exploratory Data Analysis: Summary Statistics

Notation

We assume that we have data across   entitites (or observational units)for   attributes.

In this dataset   and  .

However, let's consider a single attribute, and denote the data for thatattribute (or variable) as  .

n

p

n = 53940 p = 10

x1,x2, … ,xn

41 / 89

Exploratory Data Analysis: Summary StatisticsSince we want to understand how data is distributed across a range, weshould first define the range.

diamonds %>%

summarize(min_depth = min(depth), max_depth = max(depth))

## # A tibble: 1 x 2

## min_depth max_depth

## <dbl> <dbl>

## 1 43 79

42 / 89

Exploratory Data Analysis: Summary StatisticsWe use notation   and   to denote the minimum and maximumstatistics.

In general, we use notation   for the rank statistics, e.g., the  thlargest value in the data.

x(1) x(n)

x(q) q

43 / 89

Exploratory Data Analysis: Summary Statistics

Central Tendency

Now that we know the range over which data is distributed, we can figureout a first summary of data is distributed across this range.

Let's start with the center of the data: the median is a statistic definedsuch that half of the data has a smaller value.

We can use notation   (a rank statistic) to represent the median.x(n/2)

44 / 89

Exploratory Data Analysis: Summary Statistics

45 / 89

Exploratory Data Analysis: Summary Statistics

Derivation of the mean as central tendency statistic

Best known statistic for central tendency is the mean, or average of thedata:  . It turns out that in this case, we can be a bit moreformal about "center" means in this case.

Let's say that the center of a dataset is a point in the range of the datathat is close to the data.

To say that something is close we need a measure of distance.

¯̄x̄ = ∑ni=1 xi

1n

46 / 89

Exploratory Data Analysis: Summary StatisticsSo for two points   and   what should we use for distance?

The distance between data point   and   is  .

x1 x2

x1 x2 (x1 − x2)2

47 / 89

Exploratory Data Analysis: Summary StatisticsSo, to define the center, let's build a criterion based on this distance byadding this distance across all points in our dataset:

Here RSS means residual sum of squares, and we   to stand forcandidate values of center.

RSS(μ) =n

∑i=1

(xi − μ)21

2

μ

48 / 89

Exploratory Data Analysis: Summary StatisticsWe can plot RSS for different values of  :μ

49 / 89

Exploratory Data Analysis: Summary StatisticsNow, what should our "center" estimate be?

We want a value that is close to the data based on RSS!

So we need to find the value in the range that minimizes RSS.

50 / 89

Exploratory Data Analysis: Summary StatisticsFrom calculus, we know that a necessary condition for the minimizer of RSS is that the derivative of RSS is zero at that point.

So, the strategy to minimize RSS is to compute its derivative, and findthe value of   where it equals zero.

μ̂

μ

51 / 89

Exploratory Data Analysis: Summary Statisticsn

∑i=1

(xi − μ)2 =n

∑i=1

(xi − μ)2 (sum rule)

=n

∑i=1

μ −n

∑i=1

xi

= nμ −n

∑i=1

xi

∂μ

1

2

1

2

∂μ

52 / 89

Exploratory Data Analysis: Summary Statistics

53 / 89

Exploratory Data Analysis: Summary StatisticsNext, we set that equal to zero and find the value of   that solves thatequation:

μ

= 0 ⇒

nμ =n

∑i=1

xi ⇒

μ =n

∑i=1

xi

∂μ

1

n

54 / 89

Exploratory Data Analysis: Summary StatisticsThe fact you should remember:

The mean is the value that minimizes RSS for a vector of attributevalues

55 / 89

Exploratory Data Analysis: Summary StatisticsIt equals the value where the derivative of RSS is 0:

56 / 89

Exploratory Data Analysis: Summary StatisticsIt is the value that minimizes RSS:

57 / 89

Exploratory Data Analysis: Summary StatisticsAnd it serves as an estimate of central tendency of the dataset:

58 / 89

Exploratory Data Analysis: Summary StatisticsNote that in this dataset the mean and median are not exactly equal, butare very close:

diamonds %>%

summarize(mean_depth = mean(depth), median_depth = median(depth))

## # A tibble: 1 x 2

## mean_depth median_depth

## <dbl> <dbl>

## 1 61.7 61.8

59 / 89

Exploratory Data Analysis: Summary StatisticsThere is a similar argument to define the median as a measure of center.

In this case, instead of using RSS we use a different criterion: the sum ofabsolute deviations

The median is the minimizer of this criterion.

SAD(m) =n

∑i=1

|xi − m|.

60 / 89

Exploratory Data Analysis: Summary Statistics

61 / 89

Exploratory Data Analysis: Summary Statistics

SpreadNow that we have a measure of center, we can now discuss how data isspread around that center.

62 / 89

Exploratory Data Analysis: Summary Statistics

Variance

For the mean, we have a convenient way of describing this: the averagedistance (using squared difference) from the mean. We call this thevariance of the data:

var(x) =n

∑i=1

(xi − ¯̄x̄)21

n

63 / 89

Exploratory Data Analysis: Summary StatisticsYou will also see it with a slightly different constant in the front fortechnical reasons that we may discuss later on:

var(x) =n

∑i=1

(xi − ¯̄x̄)21

n − 1

64 / 89

Exploratory Data Analysis: Summary StatisticsVariance is a commonly used statistic for spread but it has thedisadvantage that its units are not easy to conceptualize (e.g., squareddiamond depth).

A spread statistic that is in the same units as the data is the standarddeviation, which is just the squared root of variance:

sd(x) =

n

∑i=1

(xi − ¯̄x̄)21

n

65 / 89

Exploratory Data Analysis: Summary StatisticsWe can also use standard deviations as an interpretable unit of how far agiven data point is from the mean:

66 / 89

Exploratory Data Analysis: Summary StatisticsAs a rough guide, we can use "standard deviations away from the mean"as a measure of spread as follows:

SDs proportion Interpretation

1 0.68 68% of the data is within   1 sds

2 0.95 95% of the data is within   2 sds

3 0.9973 99.73% of the data is within   3 sds

4 0.999937 99.9937% of the data is within   4 sds

5 0.9999994 99.999943% of the data is within   5 sds

6 1 99.9999998% of the data is within   6 sds

±

±

±

±

±

± 67 / 89

Exploratory Data Analysis: Summary Statistics

Spread estimates using rank statistics

Just like we saw how the median is a rank statistic used to describecentral tendency, we can also use rank statistics to describe spread.

For this we use two more rank statistics: the first and third quartiles,  and   respectively.x(n/4) x(3n/4)

68 / 89

Exploratory Data Analysis: Summary Statistics

69 / 89

Exploratory Data Analysis: Summary StatisticsNote, the five order statistics we have seen so far: minimum, maximum,median and first and third quartiles are so frequently used that this isexactly what R uses by default as a summary of a numeric vector of data(along with the mean):

summary(diamonds$depth)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 43.00 61.00 61.80 61.75 62.50 79.00

70 / 89

Exploratory Data Analysis: Summary StatisticsThis five­number summary are also all of the statistics used to constructa boxplot to summarize data distribution.

In particular, the inter­quartile range, which is defined as the differencebetween the third and first quartile:   gives ameasure of spread.

IQR(x) = x(3n/4) − x(1n/4)

71 / 89

Exploratory Data Analysis: Summary StatisticsThe interpretation here is that half the data is within the IQR around themedian.

diamonds %>%

summarize(sd_depth = sd(depth), iqr_depth = IQR(depth))

## # A tibble: 1 x 2

## sd_depth iqr_depth

## <dbl> <dbl>

## 1 1.43 1.5

72 / 89

Exploratory Data Analysis: Summary Statistics

OutliersWe can use estimates of spread to identify outlier values in a dataset.Given an estimate of spread based on the techniques we've just seen,we can identify values that are unusually far away from the center of thedistribution.

73 / 89

Exploratory Data Analysis: Summary StatisticsOne often cited rule of thumb is based on using standard deviationestimates. We can identify outliers as the set

where   is the sample mean of the data and   it's standarddeviation.

Multiplier   determines if we are identifying (in Tukey's nomenclature)outliers or points that are far out.

outlierssd(x) = {xj | |xj| > ¯̄x̄ + k × sd(x)}

¯̄x̄ sd(x)

k

74 / 89

Exploratory Data Analysis: Summary Statistics

75 / 89

Exploratory Data Analysis: Summary StatisticsWhile this method works relatively well in practice, it presents afundamental problem.

Severe outliers can significantly affect spread estimates based onstandard deviation.

Specifically, spread estimates will be inflated in the presence of severeoutliers.

76 / 89

Exploratory Data Analysis: Summary StatisticsTo circumvent this problem, we use rank­based estimates of spread toidentify outliers as:

This is usually referred to as the Tukey outlier rule, with multiplier serving the same role as before.

outliersIQR(x) = {xj |

xj < x(1/4) − k × IQR(x) or

xj > x(3/4) + k × IQR(x)}

k

77 / 89

Exploratory Data Analysis: Summary StatisticsWe use the IQR here because it is less susceptible to be inflated bysevere outliers in the dataset.

It also works better for skewed data than the method based on standarddeviation.

78 / 89

Exploratory Data Analysis: Summary Statistics

79 / 89

Exploratory Data Analysis: Summary Statistics

Skew

The five­number summary can be used to understand if data is skewed.

Consider the differences between the first and third quartiles to themedian.

80 / 89

Exploratory Data Analysis: Summary Statistics

diamonds %>%

summarize(med_depth = median(depth),

q1_depth = quantile(depth, 1/4),

q3_depth = quantile(depth, 3/4)) %>%

mutate(d1_depth = med_depth - q1_depth,

d2_depth = q3_depth - med_depth) %>%

select(d1_depth, d2_depth)

## # A tibble: 1 x 2

## d1_depth d2_depth

## <dbl> <dbl>

## 1 0.800 0.781 / 89

Exploratory Data Analysis: Summary StatisticsIf one of these differences is larger than the other, then that indicatesthat this dataset might be skewed.

The range of data on one side of the median is longer (or shorter) thanthe range of data on the other side of the median.

82 / 89

Exploratory Data Analysis: Summary Statistics

Covariance and correlation

The scatter plot is a visual way of observing relationships between pairsof variables.

Like descriptions of distributions of single variables, we would like toconstruct statistics that summarize the relationship between twovariables quantitatively.

To do this we will extend our notion of spread (or variation of data aroundthe mean) to the notion of co­variation: do pairs of variables vary aroundthe mean in the same way. 83 / 89

Exploratory Data Analysis: Summary StatisticsConsider now data for two variables over the same   entities: 

.

For example, for each diamond, we have carat and price as twovariables.

n

(x1, y1), (x2, y2), … , (xn, yn)

84 / 89

Exploratory Data Analysis: Summary Statistics

85 / 89

Exploratory Data Analysis: Summary StatisticsWe want to capture the relationship: does   vary in the same directionand scale away from its mean as  ?

This leads to covariance

xiyi

cov(x, y) =n

∑i=1

(xi − ¯̄x̄)(yi − ¯̄̄y)1

n

86 / 89

Exploratory Data Analysis: Summary StatisticsJust like variance, we have an issue with units and interpretation forcovariance, so we introduce correlation (formally, Pearson's correlationcoefficient) to summarize this relationship in a unit­less way:

cor(x, y) =cov(x, y)

sd(x)sd(y)

87 / 89

Exploratory Data Analysis: Summary StatisticsAs before, we can also use rank statistics to define a measure of howtwo variables are associated.

One of these, Spearman correlation is commonly used.

It is defined as the Pearson correlation coefficient of the ranks (ratherthan actual values) of pairs of variables.

88 / 89

Exploratory Data Analysis: Summary Statistics

Summary

EDA: visual and computational methods to describe the distribution ofdata attributes over a range of values

Grammar of graphics as effective tool for visual EDA

Statistical summaries that directly establish properties of data distribution

89 / 89

Introduction to Data Science:Exploratory Data Analysis

Héctor Corrada Bravo

University of Maryland, College Park, USA 2020­03­04

Page 2: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisWhat to do with a dataset before modeling using Statistics or MachineLearning.

Better understand the data at hand,help us make decisions about appropriate modeling methods,helpful data transformations that may be helpful to do.

1 / 89

Page 3: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisThere are many instances where statistical data modeling is not requiredto tell a clear and convincing story with data.

Many times an effective visualization can lead to convincing conclusions.

2 / 89

Page 4: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisGoal Perform an initial exploration of attributes/variables acrossentities/observations.

We will concentrate on exploration of single or pairs of variables.

Later on in the course we will see dimensionality reduction methods thatare useful in exploration of more than two variables at a time.

3 / 89

Page 5: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisComputing summary statistics

how to interpret themunderstand properties of attributes.

Data transformations

change properties of variables to help in visualization or modeling.

First, how to use visualization for exploratory data analysis.

4 / 89

Page 6: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data AnalysisUltimately, the purpose of EDA is to spot problems in data (as part ofdata wrangling) and understand variable properties like:

central trends (mean)spread (variance)skewoutliers

This will help us think of possible modeling strategies (e.g., probabilitydistributions)

5 / 89

Page 7: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

sample_frac(.1) %>%

rowid_to_column() %>%

ggplot(aes(x=rowid, y=dep_delay)) +

geom_point()

Visualization of single variables

6 / 89

Page 8: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

sample_frac(.1) %>%

arrange(dep_delay) %>%

rowid_to_column() %>%

ggplot(aes(x=rowid, y=dep_delay)) +

geom_point()

Visualization of single variables

7 / 89

Page 9: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variablesWhat can we make of that plot now? Start thinking of central tendency,spread and skew as you look at that plot.

Let's now create a graphical summary of that variable to incorporateobservations made from this initial plot.

Let's start with a histogram: it divides the range of the dep_delayattribute into equal­sized bins, then plots the number of observationswithin each bin.

8 / 89

Page 10: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

ggplot(aes(x=dep_delay)) +

geom_histogram()

Visualization of single variables

9 / 89

Page 11: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variablesDensity plot

We can (conceptually) make the bins as small as possible and get asmooth curve that describes the distribution of values of the dep_delayvariable.

10 / 89

Page 12: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

ggplot(aes(x=dep_delay)) +

geom_density()

Visualization of single variables

11 / 89

Page 13: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variablesBoxplot Succint graphical summary of the distribution of a variable.

12 / 89

Page 14: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

ggplot(aes(x='',y=dep_delay)) +

geom_boxplot()

Visualization of single variables

13 / 89

Page 15: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variablesThat's not very clear to see, so let's do a logarithmic transformation ofthis data to see distribution better.

14 / 89

Page 16: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variables

flights %>%

mutate(min_delay=min(dep_delay, na.rm=TRUE

mutate(log_dep_delay = log(dep_delay - min

ggplot(aes(x='', y=log_dep_delay)) +

geom_boxplot()

15 / 89

Page 17: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of single variablesSo what does this represent?

(a) central tendency (using the median) is represented by the black linewithin the box,

(b) spread (using inter­quartile range) is represented by the box andwhiskers.

(c) outliers (data that is unusually outside the spread of the data)

16 / 89

Page 18: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of pairs of variablesHow do each of the distributional properties we care about (central trend,spread and skew) of the values of an attribute change based on thevalue of a different attribute?

Suppose we want to see the relationship between dep_delay, anumeric variable, and origin, a categorical variable.

17 / 89

Page 19: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of pairs of variablesPreviously, we saw used group_by­summarize operations to computeattribute summaries based on the value of another attribute.

We also called this conditioning. In visualization we can start thinkingabout conditioning as we saw before.

Here is how we can see a plot of the distribution of departure delaysconditioned on origin airport.

18 / 89

Page 20: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of pairs of variables

flights %>%

mutate(min_delay = min(dep_delay, na.rm=TR

mutate(log_dep_delay = log(dep_delay - min

ggplot(aes(x=origin, y=log_dep_delay)) +

geom_boxplot()

19 / 89

Page 21: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Visualization of pairs of variablesFor pairs of continuous variables, the most useful visualization is thescatter plot.

This gives an idea of how one variable varies (in terms of central trend,variance and skew) conditioned on another variable.

20 / 89

Page 22: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

flights %>%

sample_frac(.1) %>%

ggplot(aes(x=dep_delay, y=arr_delay)) +

geom_point()

Visualization of pairs of variables

21 / 89

Page 23: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphicsWhile we have seen a basic repertoire of graphics it's easier to proceedif we have a bit more formal way of thinking about graphics and plots.

The central premise is to characterize the building pieces behind plots:

1. The data that goes into a plot, works best when data is tidy2. The mapping between data and aesthetic attributes3. The geometric representation of these attributes

22 / 89

Page 24: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R)) +

geom_point()

EDA with the grammar of graphics

23 / 89

Page 25: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphicsData: Batting table filtering for year Aesthetic attributes:

x­axis mapped to variables ABy­axis mapped to variable R

Geometric Representation: points!

Now, you can cleanly distinguish the constituent parts of the plot.

24 / 89

Page 26: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R, label=teamID)) +

geom_text()

EDA with the grammar of graphicsE.g., change the geometric representation

25 / 89

Page 27: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

# scatter plot of at bats vs. runs for 1995

batting %>%

filter(yearID == "1995") %>%

ggplot(aes(x=AB, y=R)) +

geom_point()

EDA with the grammar of graphicsE.g., change the data.

26 / 89

Page 28: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

# scatter plot of at bats vs. hits for 2010

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H)) +

geom_point()

EDA with the grammar of graphicsE.g., change the aesthetic.

27 / 89

Page 29: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphicsLet's make a line plot

What do we change? (data, aesthetic or geometry?)

28 / 89

Page 30: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

batting %>%

filter(yearID == "2010") %>%

sample_n(100) %>%

ggplot(aes(x=AB, y=H)) +

geom_line()

EDA with the grammar of graphics

29 / 89

Page 31: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphicsLet's add a regression line

What do we add? (data, aesthetic or geometry?)

30 / 89

Page 32: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphicsWhat can we see about central trend, variation and skew with this plot?

31 / 89

Page 33: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Color: color by categorical variable

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=H, color=lgID)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphicsUsing other aesthetics we can incorporate information from othervariables.

32 / 89

Page 34: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Size: size by (continuous) numericvariable

batting %>%

filter(yearID == "2010") %>%

ggplot(aes(x=AB, y=R, size=HR)) +

geom_point() +

geom_smooth(method=lm)

EDA with the grammar of graphics

33 / 89

Page 35: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphics

Faceting

The last major component of exploratory analysis called faceting invisualization, corresponds to conditioning in statistical modeling,we've seen it as the motivation of grouping when wrangling data.

34 / 89

Page 36: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

EDA with the grammar of graphics

batting %>%

filter(yearID %in% c("1995", "2000", "2010

ggplot(aes(x=AB, y=R, size=HR)) +

facet_grid(lgID~yearID) +

geom_point() +

geom_smooth(method=lm)

35 / 89

Page 37: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsLet's continue our discussion of Exploratory Data Analysis.

In the previous section we saw ways of visualizing attributes (variables)using plots to start understanding properties of how data is distributed.

In this section, we start discussing statistical summaries of data toquantify properties that we observed using visual summaries andrepresentations.

36 / 89

Page 38: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsRemember that one purpose of EDA is to spot problems in data (as partof data wrangling) and understand variable properties like:

central trends (mean)spread (variance)skewsuggest possible modeling strategies (e.g., probability distributions)

37 / 89

Page 39: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsOne last note on EDA.

John W. Tukey was an exceptional scientist/mathematician, who hadprofound impact on statistics and Computer Science.

A lot of what we cover in EDA is based on his groundbreaking work.

https://www.stat.berkeley.edu/~brill/Papers/life.pdf.

38 / 89

Page 40: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

RangePart of our goal is to understand how variables are distributed in a givendataset.

Note, again, that we are not using distributed in a formal mathematical(or probabilistic) sense.

All statements we are making here are based on data at hand, so wecould refer to this as the empirical distribution of data.

39 / 89

Page 41: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsLet's use a dataset on diamond characteristics as an example.

40 / 89

Page 42: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Notation

We assume that we have data across   entitites (or observational units)for   attributes.

In this dataset   and  .

However, let's consider a single attribute, and denote the data for thatattribute (or variable) as  .

n

p

n = 53940 p = 10

x1, x2, … , xn

41 / 89

Page 43: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsSince we want to understand how data is distributed across a range, weshould first define the range.

diamonds %>%

summarize(min_depth = min(depth), max_depth = max(depth))

## # A tibble: 1 x 2

## min_depth max_depth

## <dbl> <dbl>

## 1 43 79

42 / 89

Page 44: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWe use notation   and   to denote the minimum and maximumstatistics.

In general, we use notation   for the rank statistics, e.g., the  thlargest value in the data.

x(1) x(n)

x(q) q

43 / 89

Page 45: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Central Tendency

Now that we know the range over which data is distributed, we can figureout a first summary of data is distributed across this range.

Let's start with the center of the data: the median is a statistic definedsuch that half of the data has a smaller value.

We can use notation   (a rank statistic) to represent the median.x(n/2)

44 / 89

Page 46: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

45 / 89

Page 47: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Derivation of the mean as central tendency statistic

Best known statistic for central tendency is the mean, or average of thedata:  . It turns out that in this case, we can be a bit moreformal about "center" means in this case.

Let's say that the center of a dataset is a point in the range of the datathat is close to the data.

To say that something is close we need a measure of distance.

¯̄¯x = ∑n

i=1xi

1

n

46 / 89

Page 48: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsSo for two points   and   what should we use for distance?

The distance between data point   and   is  .

x1 x2

x1 x2 (x1 − x2)2

47 / 89

Page 49: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsSo, to define the center, let's build a criterion based on this distance byadding this distance across all points in our dataset:

Here RSS means residual sum of squares, and we   to stand forcandidate values of center.

RSS(μ) =n

∑i=1

(xi − μ)21

2

μ

48 / 89

Page 50: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWe can plot RSS for different values of  :μ

49 / 89

Page 51: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsNow, what should our "center" estimate be?

We want a value that is close to the data based on RSS!

So we need to find the value in the range that minimizes RSS.

50 / 89

Page 52: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsFrom calculus, we know that a necessary condition for the minimizer of RSS is that the derivative of RSS is zero at that point.

So, the strategy to minimize RSS is to compute its derivative, and findthe value of   where it equals zero.

μ̂

μ

51 / 89

Page 53: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statisticsn

∑i=1

(xi − μ)2 =n

∑i=1

(xi − μ)2 (sum rule)

=n

∑i=1

μ −n

∑i=1

xi

= nμ −n

∑i=1

xi

∂μ

1

2

1

2

∂μ

52 / 89

Page 54: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

53 / 89

Page 55: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsNext, we set that equal to zero and find the value of   that solves thatequation:

μ

= 0 ⇒

nμ =

n

∑i=1

xi ⇒

μ =

n

∑i=1

xi

∂μ

1

n

54 / 89

Page 56: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsThe fact you should remember:

The mean is the value that minimizes RSS for a vector of attributevalues

55 / 89

Page 57: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsIt equals the value where the derivative of RSS is 0:

56 / 89

Page 58: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsIt is the value that minimizes RSS:

57 / 89

Page 59: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsAnd it serves as an estimate of central tendency of the dataset:

58 / 89

Page 60: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsNote that in this dataset the mean and median are not exactly equal, butare very close:

diamonds %>%

summarize(mean_depth = mean(depth), median_depth = median(depth))

## # A tibble: 1 x 2

## mean_depth median_depth

## <dbl> <dbl>

## 1 61.7 61.8

59 / 89

Page 61: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsThere is a similar argument to define the median as a measure of center.

In this case, instead of using RSS we use a different criterion: the sum ofabsolute deviations

The median is the minimizer of this criterion.

SAD(m) =n

∑i=1

|xi − m|.

60 / 89

Page 62: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

61 / 89

Page 63: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

SpreadNow that we have a measure of center, we can now discuss how data isspread around that center.

62 / 89

Page 64: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Variance

For the mean, we have a convenient way of describing this: the averagedistance (using squared difference) from the mean. We call this thevariance of the data:

var(x) =n

∑i=1

(xi − ¯̄x̄)21

n

63 / 89

Page 65: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsYou will also see it with a slightly different constant in the front fortechnical reasons that we may discuss later on:

var(x) =n

∑i=1

(xi − ¯̄x̄)21

n − 1

64 / 89

Page 66: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsVariance is a commonly used statistic for spread but it has thedisadvantage that its units are not easy to conceptualize (e.g., squareddiamond depth).

A spread statistic that is in the same units as the data is the standarddeviation, which is just the squared root of variance:

sd(x) =

n

∑i=1

(xi − ¯̄x̄)21

n

65 / 89

Page 67: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWe can also use standard deviations as an interpretable unit of how far agiven data point is from the mean:

66 / 89

Page 68: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsAs a rough guide, we can use "standard deviations away from the mean"as a measure of spread as follows:

SDs proportion Interpretation

1 0.68 68% of the data is within   1 sds

2 0.95 95% of the data is within   2 sds

3 0.9973 99.73% of the data is within   3 sds

4 0.999937 99.9937% of the data is within   4 sds

5 0.9999994 99.999943% of the data is within   5 sds

6 1 99.9999998% of the data is within   6 sds

±

±

±

±

±

± 67 / 89

Page 69: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Spread estimates using rank statistics

Just like we saw how the median is a rank statistic used to describecentral tendency, we can also use rank statistics to describe spread.

For this we use two more rank statistics: the first and third quartiles,  and   respectively.x(n/4) x(3n/4)

68 / 89

Page 70: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

69 / 89

Page 71: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsNote, the five order statistics we have seen so far: minimum, maximum,median and first and third quartiles are so frequently used that this isexactly what R uses by default as a summary of a numeric vector of data(along with the mean):

summary(diamonds$depth)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 43.00 61.00 61.80 61.75 62.50 79.00

70 / 89

Page 72: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsThis five­number summary are also all of the statistics used to constructa boxplot to summarize data distribution.

In particular, the inter­quartile range, which is defined as the differencebetween the third and first quartile:   gives ameasure of spread.

IQR(x) = x(3n/4) − x(1n/4)

71 / 89

Page 73: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsThe interpretation here is that half the data is within the IQR around themedian.

diamonds %>%

summarize(sd_depth = sd(depth), iqr_depth = IQR(depth))

## # A tibble: 1 x 2

## sd_depth iqr_depth

## <dbl> <dbl>

## 1 1.43 1.5

72 / 89

Page 74: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

OutliersWe can use estimates of spread to identify outlier values in a dataset.Given an estimate of spread based on the techniques we've just seen,we can identify values that are unusually far away from the center of thedistribution.

73 / 89

Page 75: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsOne often cited rule of thumb is based on using standard deviationestimates. We can identify outliers as the set

where   is the sample mean of the data and   it's standarddeviation.

Multiplier   determines if we are identifying (in Tukey's nomenclature)outliers or points that are far out.

outlierssd(x) = {xj | |xj| > ¯̄x̄ + k × sd(x)}

¯̄x̄ sd(x)

k

74 / 89

Page 76: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

75 / 89

Page 77: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWhile this method works relatively well in practice, it presents afundamental problem.

Severe outliers can significantly affect spread estimates based onstandard deviation.

Specifically, spread estimates will be inflated in the presence of severeoutliers.

76 / 89

Page 78: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsTo circumvent this problem, we use rank­based estimates of spread toidentify outliers as:

This is usually referred to as the Tukey outlier rule, with multiplier serving the same role as before.

outliersIQR(x) = {xj |

xj < x(1/4) − k × IQR(x) or

xj > x(3/4) + k × IQR(x)}

k

77 / 89

Page 79: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWe use the IQR here because it is less susceptible to be inflated bysevere outliers in the dataset.

It also works better for skewed data than the method based on standarddeviation.

78 / 89

Page 80: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

79 / 89

Page 81: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Skew

The five­number summary can be used to understand if data is skewed.

Consider the differences between the first and third quartiles to themedian.

80 / 89

Page 82: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

diamonds %>%

summarize(med_depth = median(depth),

q1_depth = quantile(depth, 1/4),

q3_depth = quantile(depth, 3/4)) %>%

mutate(d1_depth = med_depth - q1_depth,

d2_depth = q3_depth - med_depth) %>%

select(d1_depth, d2_depth)

## # A tibble: 1 x 2

## d1_depth d2_depth

## <dbl> <dbl>

## 1 0.800 0.781 / 89

Page 83: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsIf one of these differences is larger than the other, then that indicatesthat this dataset might be skewed.

The range of data on one side of the median is longer (or shorter) thanthe range of data on the other side of the median.

82 / 89

Page 84: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Covariance and correlation

The scatter plot is a visual way of observing relationships between pairsof variables.

Like descriptions of distributions of single variables, we would like toconstruct statistics that summarize the relationship between twovariables quantitatively.

To do this we will extend our notion of spread (or variation of data aroundthe mean) to the notion of co­variation: do pairs of variables vary aroundthe mean in the same way. 83 / 89

Page 85: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsConsider now data for two variables over the same   entities: 

.

For example, for each diamond, we have carat and price as twovariables.

n

(x1, y1), (x2, y2), … , (xn, yn)

84 / 89

Page 86: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

85 / 89

Page 87: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsWe want to capture the relationship: does   vary in the same directionand scale away from its mean as  ?

This leads to covariance

xi

yi

cov(x, y) =n

∑i=1

(xi − ¯̄x̄)(yi − ¯̄̄y)1

n

86 / 89

Page 88: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsJust like variance, we have an issue with units and interpretation forcovariance, so we introduce correlation (formally, Pearson's correlationcoefficient) to summarize this relationship in a unit­less way:

cor(x, y) =cov(x, y)

sd(x)sd(y)

87 / 89

Page 89: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary StatisticsAs before, we can also use rank statistics to define a measure of howtwo variables are associated.

One of these, Spearman correlation is commonly used.

It is defined as the Pearson correlation coefficient of the ranks (ratherthan actual values) of pairs of variables.

88 / 89

Page 90: Exploratory Data Analysis Introduction to Data Science · First, how to use visualization for exploratory data analysis. 4 / 89 Ultimately, the purpose of EDA is to spot problems

Exploratory Data Analysis: Summary Statistics

Summary

EDA: visual and computational methods to describe the distribution ofdata attributes over a range of values

Grammar of graphics as effective tool for visual EDA

Statistical summaries that directly establish properties of data distribution

89 / 89