Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...

Post on 27-Jul-2020

3 views 0 download

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...

Introductionto Big Data

Chapter 10 (Week 6)Exploratory Data Analysis (Visualization)

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok Seomins@korea.ac.kr

Contents

Summary Statistic

Exploratory Data Analysis1. Visualization

Diverse plots

Single Variable Visualization2.

Diverse plots

Visualization for Two Variables 3.

Diverse plots

Visualization for More than Two Variables 4.

01Exploratory Data AnalysisSummary statistic & Visualization

copyrightⓒ 2018 All rights reserved by Korea University 4

EDA and VisualizationExploratory Data Analysis

Exploratory Data Analysis (EDA) and Visualization are very important steps in any analysis task.

Get to know your data! distributions (symmetric, normal, skewed) data quality problems outliers correlations and inter-relationships subsets of interest suggest functional relationships

Sometimes EDA or visualization could be the goal!

copyrightⓒ 2018 All rights reserved by Korea University 5

Exploratory Data AnalysisDefinition of EDA

Goal: Get a general sense of the data means, medians, quantiles, histograms, boxplots You should always look at every variable - you will learn

something!

Think interactive and visual Humans are the best pattern recognizers You can use more than 2 dimensions!

x,y,z, space, color, time….

Especially useful in early stages of data mining Detect outliers (e.g. assess data quality) Test assumptions (e.g. normal distributions or skewed?) Identify useful raw data & transforms (e.g. log(x))

Bottom line: it is always well worth looking at your data!

copyrightⓒ 2018 All rights reserved by Korea University 6

Exploratory Data AnalysisSummary Statistic

Summary statistic is not visualization Sample statistics of data X

Mean: �̅�𝑥= ∑i Xi / n Mode: most common value in X Median: X=sort(X), median = Xn/2 (half below, half above) Quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n

Interquartile range: value(Q3) - value(Q1)Range: max(X) - min(X) = Xn - X1

Variance: σ2 = ∑i (Xi - �̅�𝑥)2 / n Skewness: ∑i (Xi - �̅�𝑥)3 / [ (∑i (Xi - �̅�𝑥)2)3/2 ]

Zero if symmetric; right-skewed more common Number of distinct values for a variable

copyrightⓒ 2018 All rights reserved by Korea University 7

Exploratory Data AnalysisInformation Visualization

Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).

Visualization: converting raw data to a form that is viewable and understandable to humans.

copyrightⓒ 2018 All rights reserved by Korea University 8

Exploratory Data AnalysisInformation Visualization

Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).

copyrightⓒ 2018 All rights reserved by Korea University 9

Visual Encoding VariablesImportant components in visualization

Position Length Area Volume Value Texture Color Shape Transparency Blur / Focus...

copyrightⓒ 2018 All rights reserved by Korea University 10

Information in Hue and ValueImportant components in visualization

Value is perceived as ordered

Encode ordinal variables (O)

Encode continuous variables (Q)

Encoce nominal variables (N) using colorHue is normally perceived as unordered

copyrightⓒ 2018 All rights reserved by Korea University 11

Bertin’s Levels of OrganizationImportant components in visualization

copyrightⓒ 2018 All rights reserved by Korea University 12

Effectiveness RankingImportant components in visualization

copyrightⓒ 2018 All rights reserved by Korea University 13

Effectiveness RankingImportant components in visualization

By using the key elements of this visualization, you canaccelerate the transformation of information into knowledge.

02Single Variable VisualizationSingle!!!

copyrightⓒ 2018 All rights reserved by Korea University 15

HistogramSingle Variable Visualization

Shows center, variability, skewness, modality, outliers, or strange patterns.

Bin width and position matter

Beware of real zeros

copyrightⓒ 2018 All rights reserved by Korea University 16

Pictures of Data: Continuous VariablesHow to make a Histogram

Consider the following data collected from the 1995 StatisticalAbstracts of the United States

• For each of the 50 United States, the proportion ofindividuals over 65 years of age has been recorded

copyrightⓒ 2018 All rights reserved by Korea University 17

Pictures of Data: Continuous VariablesHow to make a Histogram

Let’s find out Max and Min values

copyrightⓒ 2018 All rights reserved by Korea University 18

Pictures of Data: Continuous VariablesHow to make a Histogram

Break the data range into mutually exclusive, equally sized “bins”:This example used 1% wide.

Let’s count the number of observations in each bin

copyrightⓒ 2018 All rights reserved by Korea University 19

Pictures of Data: Continuous VariablesDrawing the histogram based on these information

copyrightⓒ 2018 All rights reserved by Korea University 20

Pictures of Data: HistogramsAnother example

Suppose we have a sample of blood pressure data on a sampleof 113 men

Sample mean (�̅�𝑥) : 123.6 mmHg

Sample Median (Med): 123.0 mmHg

Sample sd (s): 12.9 mmHg

copyrightⓒ 2018 All rights reserved by Korea University 21

Pictures of Data: Continuous VariablesDrawing the histogram based on these information

copyrightⓒ 2018 All rights reserved by Korea University 22

Pictures of Data: Continuous VariablesDifferent bin?

copyrightⓒ 2018 All rights reserved by Korea University 23

Pictures of Data: Continuous VariablesDifferent bin?

copyrightⓒ 2018 All rights reserved by Korea University 24

Importance of IntervalsBin size

How many intervals (bins) should you have in a histogram?

• There is no perfect answer to this

• Depends on sample size n

• Rough rule of thumb: # Intervals ≈ 𝑛𝑛

copyrightⓒ 2018 All rights reserved by Korea University 25

Issues with HistogramSingle Variable Visualization

For small data sets, histograms can be misleading. Small changes in the data, bins, or anchor can deceive

For large data sets, histograms can be quite effective atillustrating general properties of the distribution.

Histograms effectively only work with 1 variable at a time.

copyrightⓒ 2018 All rights reserved by Korea University 26

BoxplotsSingle Variable Visualization

Shows a lot of information about a variable in one plot Median IQR Outliers Range Skewness

Limitations Overplotting It is hard to tell distributional

shape No standard implementation

in software (many options for whiskers, outliers)

copyrightⓒ 2018 All rights reserved by Korea University 27

BoxplotsSingle Variable Visualization

copyrightⓒ 2018 All rights reserved by Korea University 28

BoxplotsSingle Variable Visualization

SampleMedian

copyrightⓒ 2018 All rights reserved by Korea University 29

BoxplotsSingle Variable Visualization

75th Percentile

25th Percentile

copyrightⓒ 2018 All rights reserved by Korea University 30

BoxplotsSingle Variable Visualization

LargestObs.

SmallestObs.

copyrightⓒ 2018 All rights reserved by Korea University 31

Example) Hospital length of stay dataBoxplot

LargeOutliers

copyrightⓒ 2018 All rights reserved by Korea University 32

Text cloudSingle Categorical Variable Visualization

copyrightⓒ 2018 All rights reserved by Korea University 33

Sequence LogoSingle Categorical Variable Visualization

copyrightⓒ 2018 All rights reserved by Korea University 34

Network plot between wordsSingle Categorical Variable Visualization

03Visualization for Two VariablesTwo variables

copyrightⓒ 2018 All rights reserved by Korea University 36

ScatterplotsFor two continuous variables

copyrightⓒ 2018 All rights reserved by Korea University 37

ScatterplotsFor two continuous variables

Standard tool to display relation between two continuousvariables.

Useful to answer Are X and Y related each other?

Linear Quadratic Other

Variance of Y variable depend on X? Outliers present?

copyrightⓒ 2018 All rights reserved by Korea University 38

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

copyrightⓒ 2018 All rights reserved by Korea University 39

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

copyrightⓒ 2018 All rights reserved by Korea University 40

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

copyrightⓒ 2018 All rights reserved by Korea University 41

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

copyrightⓒ 2018 All rights reserved by Korea University 42

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Variation in Y differs depending on the value of X.

copyrightⓒ 2018 All rights reserved by Korea University 43

ScatterplotsFor two continuous variables

Limitation

It is very difficult to represent a lot of data at once.

copyrightⓒ 2018 All rights reserved by Korea University 44

Contour plotsFor two continuous variables (Large scale data)

Contour plots are great for representing relationships betweentwo continuous variables.

It doesn’t give you the exact location of each value, but you cansee the relationship and density between two variables.

copyrightⓒ 2018 All rights reserved by Korea University 45

Two techniques in visualizationFor large scale data

Transparent plotting

Jittering

copyrightⓒ 2018 All rights reserved by Korea University 46

Histogram with different colorOne continuous and cateogircal variables

If one variable is categorical, we can use small multiples.

‘Color’ and ‘Shape’ can visually represent different category!

copyrightⓒ 2018 All rights reserved by Korea University 47

Side-by-side boxplotOne continuous and cateogircal variables

Box-plot can likewise represent a single categorical variables asindependent boxes.

copyrightⓒ 2018 All rights reserved by Korea University 48

Barcharts and SpineplotsOne continuous and cateogircal variables

Stacked barcharts can be used to compare continuous valuesacross two or more categorical ones.

copyrightⓒ 2018 All rights reserved by Korea University 49

Pie chartsOne continuous and cateogircal variables

Very popular visualization way.

This is good for showing simple relations of proportions.

Barplots, histograms usually better (but less pretty)

04VisualizationMore than two variables

copyrightⓒ 2018 All rights reserved by Korea University 51

Pairwise scatterplotsMore than two variables

Pairwise scatterplots can represent multiple variables at once.

However, there can be difficulties in expressing categoricalvariables.

copyrightⓒ 2018 All rights reserved by Korea University 52

Multivariate visualizationMore than two variables

Creative thinking will be required to visualize multiple variables atthe same time.

Conditioning on variables

Trellis or lattice plots Different colors and shapes Infinite possibilities

copyrightⓒ 2018 All rights reserved by Korea University 53

Simple questionMore than two variables

How many dimensions are represented here?

copyrightⓒ 2018 All rights reserved by Korea University 54

Parallel Coordinate plotMore than two variables

copyrightⓒ 2018 All rights reserved by Korea University 55

Networks plotMore than two variables

Visualizaing networks is helpful, even if is not obvious that anetwork exists.

copyrightⓒ 2018 All rights reserved by Korea University 56

HeatmapMore than two variables

Heatmaps are one of the widely used visualization methods.

copyrightⓒ 2018 All rights reserved by Korea University 57

InteractivityImportant visualization aporaches these days

As the world-wide web is becoming more common these days,the importance of interaction is growing in the fild of visualization.

Demo

copyrightⓒ 2018 All rights reserved by Korea University 58

Multi-dimentional plotMore than two variables

One variable represents one dimension.

To represent three variables at once, three-dimensional space isrequird.

copyrightⓒ 2018 All rights reserved by Korea University 59

Multi-dimentional plotMore than two variables

Multi-dimentional plot seems to be fancy.

But there is a practical issue due to the ‘Viewpoint’.

We end up visualizing in 2d spaces like apper or web page.

https://plot.ly/r/3d-scatter-plots/

copyrightⓒ 2018 All rights reserved by Korea University 60

Dimension reductionVisualization

What if you need to visualize more than 1000 variables?

One possible way is to visualize high dimensional data is toreduce it to 2 or 3 dimensions.

Variable selection (Feature selection)

Principle Components (PCA analysis)

Multi-dimensional scaling

This is next topic!

End of Slide