discrete data analysis - a friendly guide to visualising categorical data · structable(outcome ~...

43
Discrete Data Analysis A Friendly Guide to Visualising Categorical Data Julian Hatwell 04 July, 2019

Upload: others

Post on 31-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Discrete Data AnalysisA Friendly Guide to Visualising Categorical Data

Julian Hatwell

04 July, 2019

Page 2: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Introduction

Page 3: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

About Me

I Over 15 years in enterprise systems, management informationsystems and BI in the for-profit education sector, UK andSingapore

I MSc (Distinction) Business Intelligence, Birmingham CityUniversity. Master’s Dissertation: An Association Rules BasedMethod for Imputing Missing Likert Scale Data

I Research to PhD (in progress): Designing Explanation Systemsfor Decision Forests, Data Analytics and Artifical Intelligenceresearch group, Birmingham City University

Page 4: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Academic Analytics

Monitor, Evaluate and Optimise

I Typical BI analysis of sales and marketing (studentrecruitment), financial, HR, operations, etc

I student life-cycle: recruitment to alumni servicesI class room utilisation and exam planningI lecturer management (part-time and adjunct)

Page 5: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

CreditsI Michael Friendly is a

pioneer in this field and hascontributed to thedevelopment of modulesand libraries for SAS and R.

I This book is the bible.I Shorter, valuable tutorial on

these topics:I R> library(“vcd”)I R> vignette(“vcd”)

I This linkhttps://www.youtube.com/watch?v=qfNsoc7Tf60 foran in depth lecture, byMichael Friendly. Figure 1: Discrete Data Analysis

with R: Visualization and ModelingTechniques for Categorical andCount Data

Page 6: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Visualising Categorical Data

Rigorous statistical analysis of categorical data is not onlysupported by visual tools, it is best performed visually.

The main theme of Michael Friendly’s work.

Page 7: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Case Study

Page 8: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Analysis of Negative Trend in Student Outcomes

This is a synthetic data set, designed to replicate and exaggeratesome problems found in a real life scenario

Academic life-cycle data for undergraduates. 2014 and 2015 intakes,graduating in 2017 and 2018 respectively. In 2018, rates ofnon-graduation (fails and withdrawals) were found to besignificantly higher than the previous year.

An analysis was conducted to determine the factors associated withthis trend.

Page 9: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Academic Data

## Total Students

## [1] 7193

## year faculty hqual## 2014:3192 fin :2129 dip :2784## 2015:4001 law :2188 dip-other: 616## mgmt:2876 hs :2404## hs-equiv :1389

## outcome grad## dist : 423 FALSE:1186## fail : 793 TRUE :6007## merit:1212## pass :4372## wthdr: 393

Page 10: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Limitations of Working with Tables Directly

## year## grad 2014 2015## FALSE 481 705## TRUE 2711 3296

Page 11: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Limitations of Working with Tables Directly

## year## grad 2014 2015## FALSE 481 705## TRUE 2711 3296loddsratio(with(sts, table(grad, year)))

## log odds ratios for grad and year#### [1] -0.1869385confint(loddsratio(with(sts, table(grad, year))))

## 2.5 % 97.5 %## FALSE:TRUE/2014:2015 -0.3134998 -0.0603772

Page 12: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Limitations of Working with Tables Directly

## hqual## faculty dip dip-other hs hs-equiv## fin 548 101 1021 459## law 710 171 788 519## mgmt 1526 344 595 411

Page 13: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Naive Approach: Barplots

Which is the right grouping variable?

Can we succinctly describe the relationship?

0

500

1000

1500

dip dip−other hs hs−equivhqual

Fre

q

faculty

fin

law

mgmt

0

500

1000

1500

fin law mgmtfaculty

Fre

q

hqual

dip

dip−other

hs

hs−equiv

Page 14: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Fourfold Plots

fourfold(with(sts, table(grad, year)))

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

481

2711

705

3296

A specialised plot for 2 × 2 contingency tables which exposes thelog odds ratio. A significant difference in the ratios shows up asnon-overlapping CI.

Page 15: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Fourfold Plots

Fourfold plots are automatically stratified over the levels of a thirdvariable, when given a 2× 2× k table.fourfold(with(sts, table(grad, year, faculty)))

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

faculty: fin

165

704

303

957

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

faculty: law

154

716

268

1050

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

faculty: mgmt

162

1291

134

1289

Page 16: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Fourfold Plots

fourfold(with(sts, table(grad, year, hqual)))

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

hqual: dip

248

998

388

1150

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

hqual: dip−other

84

199

121

212

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

hqual: hs

89

995

108

1212

grad: FALSE

year

: 201

4

grad: TRUE

year

: 201

5

hqual: hs−equiv

60

519

88

722

Page 17: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Correspondence Analysis

An unsupervised/clustering method. R Package “ca”. per levelassociations.

Multiple/Joint ca, the mjca() function, operates simultaneously ontwo or more variables. mjca is harder to interpret and not shownhere.

Simple ca, the ca() function, operates on just two variables.sts_ca_hq_out <- ca(with(sts, table(hqual, outcome)))

sts_ca_fc_out <- ca(with(sts, table(faculty, outcome)))

Page 18: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Correspondence Analysis Plot

# Generate the plotres.ca <- plot(x)# add some segments from the origin to make things clearersegments(0, 0, res.ca$cols[, 1]

, res.ca$cols[, 2], col = "red", lwd = 1)

segments(0, 0, res.ca$rows[, 1], res.ca$rows[, 2], col = "blue", lwd = 1.5, lty = 3)

Page 19: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Correspondence Analysis

Dimension 1 (97%)

Dim

ensi

on 2

(2.

8%)

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−0.

4−

0.2

0.0

0.2

dip

dip−other

hs

hs−equiv

distfail

merit

pass

wthdr

Page 20: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Correspondence Analysis

Dimension 1 (78%)

Dim

ensi

on 2

(22

%)

−0.4 −0.2 0.0 0.2

−0.

2−

0.1

0.0

0.1

0.2

fin

law

mgmt

dist

fail

merit

pass

wthdr

Page 21: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Extending Simple Correspondence AnalysisA pivot table creates rows for combos of > 1 variable.structable(outcome ~ hqual + faculty, data = sts)

## outcome dist fail merit pass wthdr## hqual faculty## dip fin 2 227 10 280 29## law 6 168 32 463 41## mgmt 54 95 224 1077 76## dip-other fin 0 52 5 39 5## law 2 79 4 76 10## mgmt 10 36 36 239 23## hs fin 92 22 274 566 67## law 45 30 146 528 39## mgmt 108 8 176 272 31## hs-equiv fin 28 42 94 271 24## law 17 26 84 363 29## mgmt 59 8 127 198 19

Page 22: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Correspondence Analysis

Dimension 1 (81.7%)

Dim

ensi

on 2

(17

%)

−1.0 −0.5 0.0 0.5

−0.

4−

0.2

0.0

0.2

0.4

0.6

dip_fin

dip_law

dip_mgmt

dip−other_fin

dip−other_law

dip−other_mgmt

hs_fin

hs_law

hs_mgmt

hs−equiv_fin

hs−equiv_law

hs−equiv_mgmt

dist

fail

merit

pass

wthdr

Page 23: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Introducing the Strucplot Framework

I Purpose built for handling tables - no dependent/independentvariables.

I Respects the table structure given.I Table is represented as rectangular tiles.I Tile area ∝ cell count.I Several Variants: mosaic, sieve, tile, assoc, spine, doubledeckerI Mosaic is the most versatile and supports higher dimensional

data and models.I Fourfold is also part of the strucplot framework but uniquely

specialised for 2× 2 contingency tables and log odds ratio tests.

Page 24: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

faculty Marginal Totals

0

Pearsonresiduals:

facu

ltym

gmt

law

fin 2129

2188

2876

Page 25: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

hqual Marginal Totals

0

Pearsonresiduals:

hqualdip dip−other hs hs−equiv

2784 616 2404 1389

Page 26: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Expected Frequencies (Assume Independence)

−3.7e−15

0.0e+00

4.3e−15

Pearsonresiduals:

hqual

facu

ltym

gmt

law

fin

dip dip−other hs hs−equiv

824 182.3 711.5 411.1

846.8 187.4 731.3 422.5

1113.1 246.3 961.2 555.4

Page 27: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Observed Frequencies

−12

−4

−2

0

2

4

12

Pearsonresiduals:

hqual

facu

ltym

gmt

law

fin

dip dip−other hs hs−equiv

548 101 1021 459

710 171 788 519

1526 344 595 411

Page 28: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Scaling to Higher Dimensions

−13

−4

−2

0

2

4

12

devianceresiduals:

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 29: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Example Code

mosaic(my_table# shading function, gp = shading_Friendly2# tile spacing, spacing = spacing_equal(sp = unit(0.4, "lines"))# label positioning, rot_labels = c(0, -45, -45, 90), rot_varnames = c(0, -90, 0, 90), offset_labels = c(0, 0.5, 0, 0), offset_varnames = c(0, 1, 0, 0.5))

Page 30: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Loglinear Models

Two discrete variables:

A = {A1, . . . , AI}, B = {B1, . . . , BJ}

Represented as an I × J contingency table. Under independence,each cell count nij in the table is assumed to be a Poissondistributed random variable:

nij ∼ Pois(ni+ · n+jn++

)

ni+ = row total, n+j = col total, n++ = table total

Page 31: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Loglinear Models

Rearranging, we get a model that is linear in all the logs:

log(ni+ · n+jn++

) = log(ni+) + log(n+j)− log(n++)

This is usually represented as:

µ+ λAi + λB

j

where µ is in “intercept” term that is a property of the total countand the λ terms are the “main” effects.

Under independence, the main effects can be calculated directlyfrom the marginal totals.

Page 32: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Loglinear Models

Deviation from independence implies an interaction term:

µ+ λAi + λB

j + λABij

To allow for estimation, we constrain:∑Ii=1 λ

Ai = 0,

∑Jj=1 λ

Bi = 0,

∑Ii=1 λ

ABij =

∑Jj=1 λ

ABij = 0.

This scales to more variables as follows:

µ+ λAi + λB

j + λCk + λAB

ij + λBCjk + λAC

ik + λABCijk

The model with all parameters (one for each table cell) included isreferred to as the saturated model - we can always fit the data, butlearn nothing.

Page 33: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Loglinear Models

When modelling, we seek the most parsimious, best fit. Find amodel that is:

I as “good” as the saturated model - non-significant residualsI does not use all the parameters / degrees of freedom

Recommended Methodology:

I generalised linear models; glm(family = poisson)I add interaction terms one at a time and hierarchically,

according to domain knowledgeI visual inspection of mosaic plots for significant residualsI repeat until residuals are non-significant and the mosaic is

“cleaned up”

Diagnostics and model selection with using anova and AIC/BIC.

Page 34: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Null Model [F][H][G][Y]

glm0 <- glm(Freq~faculty +hqual + grad + year

, data = sts_df, family = poisson)

−16

−4

−2

0

2

4

15rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 35: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Model 1 [FH][G][Y]

glm1 <- glm(Freq~faculty *hqual + grad + year

, data = sts_df, family = poisson)

−9.9

−4.0

−2.0

0.0

2.0

4.0

15.0rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 36: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Model 2 [FH][FG][Y]

glm2 <- glm(Freq~faculty *(hqual + grad) + year

, data = sts_df, family = poisson)

−10

−4

−2

0

2

4

12rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 37: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Model 3 [FH][FG][FY]

glm3 <- glm(Freq~faculty *(hqual + grad + year)

, data = sts_df, family = poisson)

−11

−4

−2

0

2

4

12rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 38: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Model 4 [FHG][FY]

glm4 <- glm(Freq~(faculty * hqual * grad) +(faculty * year)

, data = sts_df, family = poisson)

−3.7

−2.0

0.0

2.0

3.5rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 39: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Model 5 [FHG][FHY]

glm5 <- glm(Freq~(faculty * hqual * grad) +(faculty * hqual * year)

, data = sts_df, family = poisson)

−2.3

−2.0

0.0

2.0 2.2

rstandard

hqual

year

facu

lty

grad

mgm

t

20142015

20142015

20142015

TRUE

20142015

FALSE

law TRUE

FALSE

fin

dip dip−other hs hs−equiv

TRUE

FALSE

Page 40: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Case Study Conclusions

A policy change between 2014 and 2015 to increase studentnumbers led to lowering of entry level thresholds.

The foundation diploma appears not to have met the needs of somany additional students on the more challenging finance degree.

Recommend a review of curriculum and resources for the diploma tobetter support finance students, as well as interventions and supportfor students who are already one or two years into their course.

Page 41: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Diagnostic Tests

as.data.frame(anova(glm0, glm1, glm2, glm3, glm4, glm5))

## Resid. Df Resid. Dev Df Deviance## 1 40 1580.34522 NA NA## 2 34 884.04489 6 696.30033## 3 32 738.68488 2 145.36001## 4 30 664.95392 2 73.73096## 5 21 37.10138 9 627.85254## 6 12 13.88574 9 23.21564

Page 42: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

Diagnostic Tests

LRstats(glm4)

## Likelihood summary table:## AIC BIC LR Chisq Df Pr(>Chisq)## glm4 391.54 442.06 37.101 21 0.01639 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1LRstats(glm5)

## Likelihood summary table:## AIC BIC LR Chisq Df Pr(>Chisq)## glm5 386.32 453.69 13.886 12 0.3081

Page 43: Discrete Data Analysis - A Friendly Guide to Visualising Categorical Data · structable(outcome ~ hqual + faculty,data =sts) ## outcome dist fail merit pass wthdr ## hqual faculty

End