Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics...

37
1 Introduction to exploratory statistics Jean Paul Maalouf [email protected] linkedin.com/in/jean-paul-maalouf Illustrated with XLSTAT www.xlstat.com Oct. 19, 2016

Transcript of Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics...

Page 1: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

1

Introduction to exploratory statistics

Jean Paul [email protected]

linkedin.com/in/jean-paul-maalouf

Illustrated with XLSTAT

www.xlstat.com

Oct. 19, 2016

Page 2: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

2

PLAN

• XLSTAT: who are we?

• Statistics: categories

• Reminder: Variables, individuals, Descriptive Statistics

• Toward exploratory data analysis: scatter plot colored by group

• Exploratory statistics & Data Mining

• Principal Component Analysis (PCA): concept and practice

• Agglomerative Hierarchical Clustering (AHC): concept and practice

All the data in this class were made up unless

otherwise specified

Page 3: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

3

XLSTAT: Who are

we?

XLSTAT is a user-friendly

statistical add-on software

for Microsoft Excel®

Page 4: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

4

XLSTATA growing software and team

Thierry Fahmy

develops a

user-friendly

solution for

data analysis:

XLSTAT is born

XLSTAT

realizes its first

sale on the

Internet

New version,

VBA interface,

C++

computations, 7

languages

New products,

new website,

growing and

dynamic team

The company

Addinsoft is

created

New offers

adapted to

business needs

XLSTAT 365

Cloud version of

XLSTAT for Excel

365

1993 2000 2009 2016

201520061996

XLSTAT Free

Free limited

Edition

Page 5: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

5

XLSTAT in a few numbers

200+ statistical features

General or field-oriented solutions

50k users

Across the world. Companies, education, research

16 employees

Always receptive to the needs of users

120k visits/month on the website

Easy tutorials available in 5 languages

7 languages 400 downloads/day

Page 6: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

6

Statistics: 4

categories

Page 7: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

7

Statistics: 4 categories

Description Exploration Tests Modeling

I want to summarize

small data sets (1-3

variables) using

simple statistics or

charts (mean,

standard deviation,

boxplots...)

I want to easily extract

information from a

large data set

without necessarily

having a precise

question to answer.

(PCA, AHC...)

I want to accept /

reject a very precise

hypothesis assuming

error risks. (t tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Nov. 9 Nov. 30

Recording (valid until

Oct. 21)

Page 8: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

8

Reminder:

variables,

individuals,

descriptive

statistics

Page 9: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

9

Variables, individuals

Variable

An element that can take different values

Qualitative variable

A variable that cannot be quantified. Examples:

socioprofessional category, geographical origin,

type of licence, blood type..

Quantitative variable

A variable that can be quantified. Examples: invoice

amount, number of likes on Facebook, sugar

concentration, height...

Individual

Elementary statistical unit. Can be described with

variables. Examples: customers, surveyed people,

patients, laboratory mice...

Page 10: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

10

Data set: online shoe selling platform

Variables

Indiv

iduals

Page 11: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

11

Descriptive statisticsCommonly used tools according to the situation

1 qual. variableFlat sorting, mode, pie charts

1 quant. variableCenter (mean / median) ; dispersion

(variance / std. deviation / quartiles) ;

box plot

1 qual. variable x 1 qual. variableCross tabulation (contingency table)

1 quant. variable x 1 quant. variableScatter plot

1 quant. variable x 1 qual. variableQuantitative descriptive statistics per

category of the qualitative variable; multiple

box plot chart

1 quant. variable x 1 quant. variable

x 1 qual. variable

Scatter plot with points colored according

to the categories of the qualitative variable

Page 12: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

12

Toward

exploratory data

analysis: scatter

plot colored by

group

Page 13: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

13

Toward exploratory data analysis: scatter plot

colored by group

- Invoice amount decreases with time spent

on the website.

- Plutonians spend more money on the website

compared to others.

- Martians and humans form a relatively

homogeneous group

- ...

Page 14: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

14

Imagine having the same kind of reasoning

on a higher number of variables... Time for Exploratory statistics (or Exploratory

Data Analysis)

Page 15: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

15

Example: Principal Component Analysis (PCA)We want to analyze multiple variables (dimensions) at a time the same way we did with the 2D scatter plot.

Page 16: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

16

Exploratory

statisticsI want to easily extract information

from a large data set without

necessarily having a precise question

to answer.

Page 17: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

17

Exploratory statistics: a few words

Exploratory statistics

Look for information in a multi-variables data set, without having very

precise expectations. Exploratory tools are part of Data Mining.

First thing you can do: concentrate the information of big

datasets in a few dimensionsExamples: Principal Component Analysis, Correspondence Analysis…

Second thing you can do: classification ( = clustering = segmentation)Examples: Agglomerative Hierarchical Clustering, k-means…

Page 18: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

18

Principal

Component

Analysis (PCA)I’d like to summarize a big data set in a

few simple charts

- Relationships among

variables

We’ll be able to investigate:

- Proximity among individuals

- How individuals relate to

variables

Page 19: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

19

PCA: concept

Initial dataset

+

Amount of

information

-

Artificial data set synthesized by PCA

The information is re-distributed in a

way to concentrate most of it on a few

dimensions.

PCA jargon:

dimension

= axis

= factor

information

= variability

= inertia

Page 20: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

20

How PCA looks like in realityChart 1: correlation circle

- Acute angle: positively-linked variables

(e.g. weight & height)

- Right angle: uncorrelated variables (e.g.

height & shoe size)

- Obtuse angle: negatively-linked

variables (e.g. weight & time spent on

site)

Vector length reflects

representativeness in the

selected plan (F1/F2 here)

Page 21: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

21

Interpreting the axesChart 1: correlation circle

- F1 reflects:

- High weight & height (right)

- Long time spent on site (left)

- F2 is strongly related to shoe size:

- Big shoes (top)

- Small shoes (bottom)

Page 22: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

22

How PCA looks like in realityChart 1: correlation circle ; chart 2: observations

Weight+

Height+

time on site-

Weight-

Height-

time on site+

Page 23: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

23

PCA: explorations ...

Weight increases with height Shoe size is unrelated to weight / height

Time spent on site decreases with weight & height Derrick has big feet. Shaun has small feet.

Looks like there are two clusters in the data And so on...

PCA tutorial link

PCA works only with quantitative data. Click here to check out other exploratory methods.

Page 24: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

24

It was easy to detect two clusters of

customers. Nice for marketing!

Weight+

Height+

time on site-

Weight-

Height-

time on site+

But what if groups were not that

easy to define visually?

According to our PCA, customers can

be split into two clusters characterized

by height, weight and time spent on site.

This may help us define tailored

marketing campaigns.

Page 25: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

25

Agglomerative

Hierarchical

Clustering (AHC)

I want to cluster ( = classify =

segment) individuals in homogeneous

groups ( = segments = clusters =

classes)

Page 26: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

26

Agglomerative Hierarchical Clustering (AHC)

How to cluster consumers into different groups?

Illustration with 2 variablesEXAMPLE: sensory analysis, chocolate consumers survey

Page 27: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

27

AHC – how it works on 2 variables

xx

x

19 groups18 groups17 groups16 groups15 groups14 groups8 groups9 groups7 groups6 groups5 groups4 groups3 groups2 groups1 group

Choosing a

“cutting” level

Segments

are now

defined

Age

This can obviously be

generalized over

more than 2 variables

Page 28: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

28

Agglomerative Hierarchical Clustering (AHC)What it looks like in XLSTAT:

The higher the “vertical

distance” between two

individuals (or groups), the

more different the

individuals.

Here we could split the

individuals into 3 or 4

homogeneous groups

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

Page 29: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

29

Agglomerative Hierarchical Clustering (AHC)3-cluster split:

Okay. And now what?

Let’s describe the 3 groups to see how we

could take action on a marketing scale

AHC tutorial link

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

Page 30: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

30

How can I describe

segments?

Things become quite

straightforward when you extract

class membership in the CAH

results

Page 31: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

31

Describing the segments

Split individuals into classes and run

descriptive statistics on each

segment

Use Class membership as a

supplementary variable in a PCA

Use Parallel Coordinates Plots

Things you can do

Page 32: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

32

Describing clusters: descriptive statistics

Consumers from

clusters 1 & 3 are

more loyal to

brands than those

from cluster 2

Consumers from

cluster 2 are

younger

Page 33: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

33

Describing clusters: parallel coordinates plot

Cluster 3: older consumers, loyal to

brands, who prefer bitter chocolate

and are not online buyers...

Cluster 2: younger consumers, prefer

frozen chocolate, are sensitive to

prices and care less about brands

Consequences :

- Promote branded bitter chocolate

to older consumers

- Promote cheaper chocolates to

younger consumers

- …

Tutorial link

Brand loyalty Price sensitivity Online buyer Bitter Frozen Crunchy Age

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

Parallel coordinates plot

1 2 3

Page 34: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

34

In summary...

Description Exploration Tests Modeling

I want to summarize

small data sets (1-3

variables) using

simple statistics or

charts. Leads to

hypotheses.

I want to easily extract

information from a

large data set without

necessarily having a

precise question to

answer. Leads to

hypotheses.

I want to validate /

reject a very precise

hypothesis assuming

error risks. (t tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Nov. 9 Nov. 30

Recording (valid until

Oct. 21)

Page 35: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

35

Exploratory statistics: Take Home

Message

Exploratory statistics

Allow to gain insight into large data sets

They give a synthetic view of large data sets

Examples: Principal Component Analysis, Correspondence Analysis, MDS…

They allow clustering data sets

Examples: Agglomerative Hierarchical Clustering, k-means

Click here to choose an appropriate exploratory data analysis tool according to

your situation

Page 36: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

36

Data exploration inspired us many hypotheses. Are they valid?

Statistical tests

See you on Nov. 9!

www.xlstat.com/fr/formation

Page 37: Introduction to exploratory statistics - cdn.xlstat.com · 1 Introduction to exploratory statistics Jean Paul Maalouf jpmaalouf@xlstat.com linkedin.com/in/jean-paul-maalouf Illustrated

37

Thanks for attending!All the tools we saw are available in all XLSTAT solutions

Survey time…