Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or...

49
Exploratory data analysis November 29, 2017 Dr. Khajonpong Akkarajitsakul Department of Computer Engineering, Faculty of Engineering King Mongkut’s University of Technology Thonburi

Transcript of Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or...

Page 1: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Exploratory data analysis

November 29, 2017

Dr. Khajonpong Akkarajitsakul

Department of Computer Engineering, Faculty of EngineeringKing Mongkut’s University of Technology Thonburi

Page 2: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Module III – OverviewLearning Outcome• Understand the basic

concepts of exploratory data analysis

• Understand the role of statistics in data exploration

• Choose appropriate data analysis techniques to explore and analyze data

Agenda• Basic concepts of

exploratory data analysis (EDA)

• EDA techniques

Page 3: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Introduction to data

• Process of scientific inquiry:1. Identify a question or problem2. Collect relevant data on the topic3. Analyze the data4. Form a conclusion

• Statistics focuses on making stages (2) – (4) objective, rigorous, and efficient.

• Statistics is the study of how best to collect, analyze and draw conclusion from data.

Page 4: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Exploratory Data Analysis

• Exploratory data analysis or “EDA” is a critical step in analyzing the data

• The main reasons are• detection of mistakes, outliers or abnormalities• checking of assumptions• preliminary selection of appropriate models• determining relationships among the explanatory variables• assessing the direction and rough size of relationships between

explanatory and outcome variables

Page 5: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Exploratory Data Analysis

Page 6: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Data format

• Data from either experiments or operations are generally collected in databases (e.g. spreadsheet)

• One row per record and one column for each identifiers, outcome variables, and explanatory variables

• Each column contains the numerical value of a particular quantitative variable (aka measure) or the levels for a categorical variable (aka dimension)

Page 7: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Data: readingimport pandas as pddf = pd.read_csv(‘mtcars.csv’, header=0)

Page 8: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Types of EDA

• Graphical or Non-graphical• Non-graphical methods usually involve with calculation of summary

statistics• Graphical methods obviously summarize the data in a diagrammatic or

pictorial way

• Univariate or Multivariate• Univariate methods look at one variable (column) at a time while

multivariate methods look at two or more variables at a time• Usually it is a good idea to perform univariate EDA first for each of a

component of multivariate EDA

Page 9: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Four basic types of EDA

• Univariate non-graphical• Multivariate non-graphical• Univariate graphical• Multivariate graphical

Page 10: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Univariate non-graphical EDA

• This is to measure certain characteristics (e.g. age, gender, speed at a task, or response to a stimulus) of data of all subjects/records

• We should think of measurements as representations of a “sample distribution”, which in turn more or less representing the “population distribution”

• The goal is to better understand the “sample distribution” and make some conclusion about the “population distribution”

Page 11: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Univariate non-graphical EDACategorical data• The characteristics of interest for a categorical variable are

simply the range of values and the frequency (or relative frequency) of occurrence for each value.

• Therefore the only useful univariate non-graphical techniques for categorical variables is some form of tabulation of the frequencies, usually along with calculation of the fraction (or percent) of data that falls in each category.

Page 12: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pandas: creating frequency tableimport pandas as pddf = pd.read_csv(‘iris.csv’, header=0)df.Species.value_counts()

Page 13: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

• Univariate EDA for a quantitative variable is a way to make preliminary assessments about the population distribution of the variable using the data of the observed sample.

• The characteristics of the population distribution of a quantitative variable are its center, spread, modality (number of peaks in the pdf), shape (including “heaviness of the tails"), and outliers.

Univariate non-graphical EDAQuantitative data

Page 14: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Histogram

What can you see from this histogram?

Central tendencySpreadSkewnessEtc.

Page 15: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Central tendencyMean

Page 16: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Central tendencyMedian

Page 17: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Central tendencyWhich location measure is the best?

Page 18: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Mean vs Median

Min. 1st Qu. Median Mean 3rd Qu. Max. 1000 8500 14000 15244 20000 35000

Median < Mean

What happen?

Page 19: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Scale: Variance

Page 20: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Why squared deviations?

• Adding deviations will yield a sum of ?• Absolute values do not have nice mathematical properties

(non-linear)• Squares eliminate the negatives

• Results:• Increasing contribution to the variance as you go farther from the

mean

Page 21: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Scale: Variance

• Variance is somewhat arbitrary• What does it mean to have a variance of 8.9? Or 1.5? Or

1245.34? Or 0.00001?• Nothing. But if you could “standardize” that value, you could

talk or compare about any variance (i.e. deviation) in equivalent terms

• Standard deviations are simply the square root of the variance

Page 22: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Scale: Standard deviation

Page 23: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Scale: Quartiles and IQR

Page 24: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Percentiles (aka Quantiles)

Page 25: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Common distribution shapes

Page 26: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Other distribution shapes

Page 27: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Skewness IPositively skewed• Longer tail in the high value• Mean > Median > Mode

Page 28: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Skewness IINegatively skewed• Longer tail in the low value• Mode > Median > Mean

Page 29: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Univariate Graphical EDAHistogram• Histogram is a graphical representation of

the distribution of numerical data

• It provides a view of data density and the shape of data distribution

• To construct a histogram, the first step is to • bin the range of values• count how many values fall into each interval

• The bins are usually specified as consecutive, non-overlapping intervals of a variable.

• The bins (intervals) must be adjacent, and are usually equal size.

Page 30: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Univariate Graphical EDAEffects of Histogram Bin

Page 31: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Univariate Graphical EDAEffects of Number of Samples

Page 32: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Matplotlib: Histogram• import matplotlib.pyplot as plt• import numpy as np

• data = np.random.randn(1000)• plt.hist(data)• plt.title("Histogram")• plt.xlabel("Value")• plt.ylabel("Frequency")• plt.show()

Page 33: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Boxplot

• The box in boxplot represents the middle 50% of the data

• The middle line indicates median

• Whiskers can be designated as either

• Max/Min• Outlier boundaries

• Upper = Q3 + 1.5*IQR• Lower = Q1 – 1.5*IQR

Page 34: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Matplotlib: Box plots• import matplotlib.pyplot as plt• import numpy as np

• spread = np.random.rand(50) * 100• center = np.ones(25) * 50• flier_high = np.random.rand(10) * 100 + 100• flier_low = np.random.rand(10) * -100• data = np.concatenate((spread, center, • flier_high, flier_low), 0)• plt.boxplot(data)• plt.show()

Page 35: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Multivariate Non-Graphical EDA

• Multivariate non-graphical EDA techniques show the relationship between two or more variables in the form of either cross-tabular or statistics

Page 36: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Cross-Tabulation

Page 37: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pandas: Cross-Tabulationimport pandas as pd

raw_data = {'SUBJECT_ID': ['GW','JA','TJ','JMA','JMO','JQA','AJ','MVB','WHH','JT','JKP'],'AGE_GROUP': ['Y','M','Y','Y','M','O','O','Y','O','Y','M'],'SEX': ['F','F','M','M','F','F','F','M','F','F','M']}

df = pd.DataFrame(raw_data, columns = ['SUBJECT_ID', 'AGE_GROUP', 'SEX'])print dfprint pd.crosstab(df.AGE_GROUP, df.SEX, margins=True)

Page 38: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Multivariate Graphical EDAScatter plot

• A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable x and the dependent variable y.

Subject Age x Pressure yA 43 128B 48 120C 56 135D 61 143E 67 141F 70 152

120

125

130

135

140

145

150

155

40 50 60 70 80

Pres

sure

Age

Page 39: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Positive and negative relationship

• Simple relationships can also be positive or negative

• A positive relationship exists when both variables increase or decrease at the same time.

• In a negative relationship, as one variable increases, the other variable decreases, and vice versa.

Page 40: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pearson product moment correlation

• Given n pairs of observations (x1,y1), (x2,y2), …,(xn,yn)

• It is natural to speak of x and y having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s

• On the contrary, if large x’s are paired with small y’s and small x’s with large y’s, then a negative relationship between the variable is implied

Page 41: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pearson product moment correlation

• Consider the quantity

• Then, if the relationship is strongly positive, an xiabove the mean will tend to be paired with a yi above the mean, so that and this product will also be positive whenever both xiand yi are below their means

( )( )1

n

xy i ii

s x x y y=

= − −∑

( )( ) 0i ix x y y− − >

Page 42: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pearson product moment correlation

• To make this measure dimensionless, we divide as follow

• A more convenient form for this equation is

( )( )( ) ( )

12 2

1 1

ni ixy i

n nxx yy i ii i

x x y ysr

s s x x y y=

= =

− −= =

− −

∑∑ ∑

( ) ( )( )( ) ( ) ( ) ( )2 22 2

i i i i

i i i i

n x y x yr

n x x n y y

−=

− −

∑ ∑ ∑∑ ∑ ∑ ∑

Page 43: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Correlation coefficient and scatter plot

• The range of the correlation coefficient is from -1 to 1.• If there is a strong positive linear relationship between the variables,

the value of r will be close to 1.• If there is a strong negative linear relationship, the value of r will be

close to -1.• When there is no linear relationship between the variables or only a

weak one, the value of r will be close to 0

Page 44: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Strong correlation?

• A frequently asked question is: “what can it be said that there is a strong correlation between variables, and when is the correlation weak?”

• A reasonable rule of thumb is to say that the correlation is

• weak if 0≤|r| ≤0.5• strong 0.8≤|r|≤1• moderate otherwise

Page 45: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Correlation and shape

Page 46: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Remarks on correlation• When the null hypothesis has been rejected for a

specific value, any of the following five relationships between variables can exist

1. There is a direct cause-and-effect relationship between the variables. That is x causes y.

2. There is a reverse cause-and-effect relationship between the variables. That is y causes x.

3. The relationship between the variables may be caused by a third variable

4. The may be a complexity of interrelationships among many variables

5. The relationship may be coincidental

Page 47: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pandas: Correlationimport matplotlib.pyplot as pltimport pandas as pdfrom pandas.tools.plotting import scatter_matrixurl = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"df = pd.read_csv(url)df[['mpg', 'disp', 'drat', 'wt']].corr(method='pearson', min_periods=1)

Page 48: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

Pandas: Scatter Matrix• scatter_matrix(df[['mpg', 'disp', 'drat', 'wt']], alpha=0.2, figsize=(6, 6), diagonal='kde')• plt.show()

Page 49: Exploratory data analysis - OCSC · Exploratory Data Analysis • Exploratory data analysis or “EDA” is a critical step in analyzing the data • The main reasons are • detection

End of Lecture

Thank you