introduction to exploratory statistics - xlstat, your … · introduction to exploratory statistics...

Post on 26-Aug-2018

264 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Introduction to exploratory statistics

Jean Paul Maaloufjpmaalouf@xlstat.com

linkedin.com/in/jean-paul-maalouf

Illustrated with XLSTAT

www.xlstat.com

Oct. 19, 2016

2

PLAN

• XLSTAT: who are we?

• Statistics: categories

• Reminder: Variables, individuals, Descriptive Statistics

• Toward exploratory data analysis: scatter plot colored by group

• Exploratory statistics & Data Mining

• Principal Component Analysis (PCA): concept and practice

• Agglomerative Hierarchical Clustering (AHC): concept and practice

All the data in this class were made up unless

otherwise specified

3

XLSTAT: Who are

we?

XLSTAT is a user-friendly

statistical add-on software

for Microsoft Excel®

4

XLSTATA growing software and team

Thierry Fahmy

develops a

user-friendly

solution for

data analysis:

XLSTAT is born

XLSTAT

realizes its first

sale on the

Internet

New version,

VBA interface,

C++

computations, 7

languages

New products,

new website,

growing and

dynamic team

The company

Addinsoft is

created

New offers

adapted to

business needs

XLSTAT 365

Cloud version of

XLSTAT for Excel

365

1993 2000 2009 2016

201520061996

XLSTAT Free

Free limited

Edition

5

XLSTAT in a few numbers

200+ statistical features

General or field-oriented solutions

50k users

Across the world. Companies, education, research

16 employees

Always receptive to the needs of users

120k visits/month on the website

Easy tutorials available in 5 languages

7 languages 400 downloads/day

6

Statistics: 4

categories

7

Statistics: 4 categories

Description Exploration Tests Modeling

I want to summarize

small data sets (1-3

variables) using

simple statistics or

charts (mean,

standard deviation,

boxplots...)

I want to easily extract

information from a

large data set

without necessarily

having a precise

question to answer.

(PCA, AHC...)

I want to accept /

reject a very precise

hypothesis assuming

error risks. (t tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Nov. 9 Nov. 30

Recording (valid until

Oct. 21)

8

Reminder:

variables,

individuals,

descriptive

statistics

9

Variables, individuals

Variable

An element that can take different values

Qualitative variable

A variable that cannot be quantified. Examples:

socioprofessional category, geographical origin,

type of licence, blood type..

Quantitative variable

A variable that can be quantified. Examples: invoice

amount, number of likes on Facebook, sugar

concentration, height...

Individual

Elementary statistical unit. Can be described with

variables. Examples: customers, surveyed people,

patients, laboratory mice...

10

Data set: online shoe selling platform

Variables

Indiv

iduals

11

Descriptive statisticsCommonly used tools according to the situation

1 qual. variableFlat sorting, mode, pie charts

1 quant. variableCenter (mean / median) ; dispersion

(variance / std. deviation / quartiles) ;

box plot

1 qual. variable x 1 qual. variableCross tabulation (contingency table)

1 quant. variable x 1 quant. variableScatter plot

1 quant. variable x 1 qual. variableQuantitative descriptive statistics per

category of the qualitative variable; multiple

box plot chart

1 quant. variable x 1 quant. variable

x 1 qual. variable

Scatter plot with points colored according

to the categories of the qualitative variable

12

Toward

exploratory data

analysis: scatter

plot colored by

group

13

Toward exploratory data analysis: scatter plot

colored by group

- Invoice amount decreases with time spent

on the website.

- Plutonians spend more money on the website

compared to others.

- Martians and humans form a relatively

homogeneous group

- ...

14

Imagine having the same kind of reasoning

on a higher number of variables... Time for Exploratory statistics (or Exploratory

Data Analysis)

15

Example: Principal Component Analysis (PCA)We want to analyze multiple variables (dimensions) at a time the same way we did with the 2D scatter plot.

16

Exploratory

statisticsI want to easily extract information

from a large data set without

necessarily having a precise question

to answer.

17

Exploratory statistics: a few words

Exploratory statistics

Look for information in a multi-variables data set, without having very

precise expectations. Exploratory tools are part of Data Mining.

First thing you can do: concentrate the information of big

datasets in a few dimensionsExamples: Principal Component Analysis, Correspondence Analysis…

Second thing you can do: classification ( = clustering = segmentation)Examples: Agglomerative Hierarchical Clustering, k-means…

18

Principal

Component

Analysis (PCA)I’d like to summarize a big data set in a

few simple charts

- Relationships among

variables

We’ll be able to investigate:

- Proximity among individuals

- How individuals relate to

variables

19

PCA: concept

Initial dataset

+

Amount of

information

-

Artificial data set synthesized by PCA

The information is re-distributed in a

way to concentrate most of it on a few

dimensions.

PCA jargon:

dimension

= axis

= factor

information

= variability

= inertia

20

How PCA looks like in realityChart 1: correlation circle

- Acute angle: positively-linked variables

(e.g. weight & height)

- Right angle: uncorrelated variables (e.g.

height & shoe size)

- Obtuse angle: negatively-linked

variables (e.g. weight & time spent on

site)

Vector length reflects

representativeness in the

selected plan (F1/F2 here)

21

Interpreting the axesChart 1: correlation circle

- F1 reflects:

- High weight & height (right)

- Long time spent on site (left)

- F2 is strongly related to shoe size:

- Big shoes (top)

- Small shoes (bottom)

22

How PCA looks like in realityChart 1: correlation circle ; chart 2: observations

Weight+

Height+

time on site-

Weight-

Height-

time on site+

23

PCA: explorations ...

Weight increases with height Shoe size is unrelated to weight / height

Time spent on site decreases with weight & height Derrick has big feet. Shaun has small feet.

Looks like there are two clusters in the data And so on...

PCA tutorial link

PCA works only with quantitative data. Click here to check out other exploratory methods.

24

It was easy to detect two clusters of

customers. Nice for marketing!

Weight+

Height+

time on site-

Weight-

Height-

time on site+

But what if groups were not that

easy to define visually?

According to our PCA, customers can

be split into two clusters characterized

by height, weight and time spent on site.

This may help us define tailored

marketing campaigns.

25

Agglomerative

Hierarchical

Clustering (AHC)

I want to cluster ( = classify =

segment) individuals in homogeneous

groups ( = segments = clusters =

classes)

26

Agglomerative Hierarchical Clustering (AHC)

How to cluster consumers into different groups?

Illustration with 2 variablesEXAMPLE: sensory analysis, chocolate consumers survey

27

AHC – how it works on 2 variables

xx

x

19 groups18 groups17 groups16 groups15 groups14 groups8 groups9 groups7 groups6 groups5 groups4 groups3 groups2 groups1 group

Choosing a

“cutting” level

Segments

are now

defined

Age

This can obviously be

generalized over

more than 2 variables

28

Agglomerative Hierarchical Clustering (AHC)What it looks like in XLSTAT:

The higher the “vertical

distance” between two

individuals (or groups), the

more different the

individuals.

Here we could split the

individuals into 3 or 4

homogeneous groups

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

29

Agglomerative Hierarchical Clustering (AHC)3-cluster split:

Okay. And now what?

Let’s describe the 3 groups to see how we

could take action on a marketing scale

AHC tutorial link

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

30

How can I describe

segments?

Things become quite

straightforward when you extract

class membership in the CAH

results

31

Describing the segments

Split individuals into classes and run

descriptive statistics on each

segment

Use Class membership as a

supplementary variable in a PCA

Use Parallel Coordinates Plots

Things you can do

32

Describing clusters: descriptive statistics

Consumers from

clusters 1 & 3 are

more loyal to

brands than those

from cluster 2

Consumers from

cluster 2 are

younger

33

Describing clusters: parallel coordinates plot

Cluster 3: older consumers, loyal to

brands, who prefer bitter chocolate

and are not online buyers...

Cluster 2: younger consumers, prefer

frozen chocolate, are sensitive to

prices and care less about brands

Consequences :

- Promote branded bitter chocolate

to older consumers

- Promote cheaper chocolates to

younger consumers

- …

Tutorial link

Brand loyalty Price sensitivity Online buyer Bitter Frozen Crunchy Age

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

Parallel coordinates plot

1 2 3

34

In summary...

Description Exploration Tests Modeling

I want to summarize

small data sets (1-3

variables) using

simple statistics or

charts. Leads to

hypotheses.

I want to easily extract

information from a

large data set without

necessarily having a

precise question to

answer. Leads to

hypotheses.

I want to validate /

reject a very precise

hypothesis assuming

error risks. (t tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Nov. 9 Nov. 30

Recording (valid until

Oct. 21)

35

Exploratory statistics: Take Home

Message

Exploratory statistics

Allow to gain insight into large data sets

They give a synthetic view of large data sets

Examples: Principal Component Analysis, Correspondence Analysis, MDS…

They allow clustering data sets

Examples: Agglomerative Hierarchical Clustering, k-means

Click here to choose an appropriate exploratory data analysis tool according to

your situation

36

Data exploration inspired us many hypotheses. Are they valid?

Statistical tests

See you on Nov. 9!

www.xlstat.com/fr/formation

37

Thanks for attending!All the tools we saw are available in all XLSTAT solutions

Survey time…

top related