applying generalized additive mixed modeling: tuscan dialects … · 2014-04-08 · applying...

20
Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling 1 , Simonetta Montemagni 2 , John Nerbonne 1 and Harald Baayen 3 1 University of Groningen, Center for Language and Cognition Groningen 2 Istituto di Linguistica Computationale “Antonio Zampolli”, CNR, Pisa 3 Eberhard Karls University, Tübingen; University of Alberta, Edmonton Leuven Statistics Days 2012, June 7 - 8, 2012 Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 1/20

Upload: others

Post on 05-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Applying Generalized Additive Mixed Modeling:Tuscan Dialects vs. Standard Italian

Martijn Wieling1, Simonetta Montemagni2, John Nerbonne1 and HaraldBaayen3

1University of Groningen, Center for Language and Cognition Groningen

2Istituto di Linguistica Computationale “Antonio Zampolli”, CNR, Pisa

3Eberhard Karls University, Tübingen; University of Alberta, Edmonton

Leuven Statistics Days 2012, June 7 - 8, 2012

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 1/20

Page 2: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Overview

IntroductionGeneralized Additive Mixed ModelingStandard Italian and Tuscan dialects

Material

Methods

Results

Discussion

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 2/20

Page 3: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Generalized Additive Modeling (1)

linear model : linear relationship between predictors and dependentvariable: y = a1x1 + ...+ anxn

Non-linearities via explicit parametrization: y = a1x21 + a2x1 + ...

generalized linear model : linear relationship between predictors anddependent variable via link function: g(y) = a1x1 + ...+ anxn

Example: logistic regression for predicting a binary outcome

generalized additive model (GAM): relationship between individualpredictors and (possibly transformed) dependent variable is estimated bya non-linear smooth function: g(y) = s(x1) + s(x2, x3) + a4x4 + ...

multiple predictors can be combined in a (hyper)surface smooth

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 3/20

Page 4: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Generalized Additive Modeling (2)

Advantage of GAM over manual specification of non-linearities: theoptimal shape of the non-linearity is determined automatically

appropriate degree of smoothness is automatically determined on the basisof cross validation to prevent overfitting

Choosing a smoothing basisSingle predictor or isotropic predictors: thin plate regression spline

Efficient approximation of the optimal (thin plate) spline

Combining non-isotropic predictors: tensor product spline

Generalized Additive Mixed Modeling:Random effects can be treated as smooths as well (Wood, 2008)R: gam and bam (package mgcv)

For more (mathematical) details, see Wood (2006)

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 4/20

Page 5: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Standard Italian and Tuscan dialects

Standard Italian originated in the 14th century as a written languageIt originated from the prestigious Florentine varietyThe spoken standard Italian language was adopted in the 20th century

People used to speak in their local dialect

In this study, we investigate the relationship between standard Italian andTuscan dialects

We focus on lexical variationWe attempt to identify which social, geographical and lexical variablesinfluence this relationship

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 5/20

Page 6: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Material: lexical data

We used lexical data from the Atlante Lessicale Toscano (ALT)We focus on 213 locations (> 2000 informants) and 170 conceptsWe grouped the informants based on age (young/old: born after/before1930) and used the majority’s lexical form

It was computationally not feasible to include the informants individually

Total number of cases: 69,259For every case, we identified if the lexical form was different from standardItalian (1) or the same (0)

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 6/20

Page 7: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Geographic distribution of locations

S

FP

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 7/20

Page 8: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Material: additional data

In addition, we obtained the following information:Number of inhabitants in each locationAverage income in each locationAverage age in each locationFrequency of each concept

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 8/20

Page 9: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Modeling geography’s influence with a GAM

# logistic regression: family="binomial"> library(mgcv) # current version 1.7-17> geo = gam(NotStd ~ s(Lon,Lat), data=tusc, family="binomial", method="ML")> vis.gam(geo,view=c("Lon","Lat"),plot.type="contour",color="terrain",...)

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

Contour plot

Longitude

Latit

ude

−0.

5

−0.5

−0.4

−0.4

−0.4

−0.3

−0.3

−0.3

−0.2

−0.2

−0.2

−0.1

−0.1

−0.1

−0.1

0

0

0

0.1

0.1

0.1

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 9/20

Page 10: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Adding a random intercept to create a GAMM

# bam is quicker and less memory-intensive than gam (here 0.3 GB vs. 1.8 GB)> model = bam(NotStd ~ s(Lon,Lat) + s(Concept,bs="re"),

data=tusc, family="binomial", method="ML")> summary(model,freq=F)

Family: binomialLink function: logit

Formula:NotStd ~ s(Lon,Lat) + s(Concept,bs="re")

Parametric coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.1859 0.1302 -1.428 0.153

Approximate significance of smooth terms:edf Ref.df Chi.sq p-value

s(Lon,Lat) 25.42 27.72 703.5 <2e-16 ***s(Concept) 167.47 168.00 15797.1 <2e-16 ***

R-sq.(adj) = 0.319 Deviance explained = 26.8%ML score = 97999 Scale est. = 1 n = 69259

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 10/20

Page 11: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Adding a random slope to a GAMM

> model2 = bam(NotStd ~ s(Lon,Lat) + s(Concept,bs="re")+ s(Concept,CommSize,bs="re"),

data=tusc, family="binomial", method="ML")> summary(model2,freq=F)

Family: binomialLink function: logit

Formula:NotStd ~ s(Lon,Lat) + s(Concept,bs="re") + s(Concept,CommSize,bs="re")

Parametric coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.1865 0.1307 -1.427 0.154

Approximate significance of smooth terms:edf Ref.df Chi.sq p-value

s(Lon,Lat) 24.82 27.42 605.9 <2e-16 ***s(Concept) 167.47 168.00 15816.6 <2e-16 ***s(Concept,CommSize) 115.27 168.00 571.8 <2e-16 ***

R-sq.(adj) = 0.323 Deviance explained = 27.3%ML score = 97869 Scale est. = 1 n = 69259

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 11/20

Page 12: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Varying geography’s influence based on concept freq.

Wieling, Nerbonne and Baayen (2011, PLoS ONE) showed that the effectof word frequency varied depending on geographyHere we explicitly include this in the GAMM and we also allow forvariation per age group

> m = bam(NotStd ~ te(Lon, Lat, Freq, by=IsOld, d=c(2,1)) + IsOld + CommSize+ ..., data=tusc, family="binomial", method="ML")

The results will be discussed next...Note that the exact random-effect structure is subject to change, as therewere some changes made to the mgcv package recently, affecting p-valuecalculations of the random-effect smooths (i.e. use version ≥ 1.7-17)

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 12/20

Page 13: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Results: fixed effects and smooths

Estimate Std. Error z-value p-valueIntercept -0.4129 0.13950 -2.960 0.003

Old vs. young 0.4407 0.01925 22.896 < 0.001Community size (log) -0.1154 0.02658 -4.343 < 0.001

Est. d.o.f. Chi. sq. p-valueGeo × concept frequency (old) 53.88 221.5 < 0.001

Geo × concept frequency (young) 59.48 326.6 < 0.001

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 13/20

Page 14: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

A complex geographical pattern

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

Low freq. concepts (old speakers)

Longitude

Latit

ude 2.2

2.4 2.6

2.6

2.6

2.8

2.8

2.8

3

3

3

3 3

.2

3.4

3.4

3.6

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

Mean freq. concepts (old speakers)

Longitude

Latit

ude

2

2.2

2.8

2.8

3

3

3

3

3.2

3.2

3.4

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

High freq. concepts (old speakers)

Longitude

Latit

ude

1.8

2.4

2.6

2.6

2.8

2.8

2.8

3

3

3

3.2 3.4

3.6

3.8

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

Low freq. concepts (young speakers)

Longitude

Latit

ude 2

2.2

2.4

2.6

2.6

2.6

2.8

2.8

2.8

3

3 3

.2

3.4

3.4

3.6 10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

Mean freq. concepts (young speakers)

Longitude

Latit

ude

2

2.2 2.4 2.4

2.4

2.6

2.6

2.6

2.6

2.8

2.8

3

10.0 10.5 11.0 11.5 12.0

42.5

43.0

43.5

44.0

High freq. concepts (young speakers)

Longitude

Latit

ude

2

2.2

2.2

2.4

2.4

2.4

2.6

2.6

2.8

2.8

3 3

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 14/20

Page 15: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Results: random effects

Factors Random effects Std. dev. p-valueLocation Intercept 0.2410 < 0.001Concept Intercept 1.7748 < 0.001

Average community income (log) 0.3127 < 0.001Average community age (log) 0.2482 < 0.001Community size (log) 0.1166 0.006

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 15/20

Page 16: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

By-concept random slopes for community size

−0.4

−0.3

−0.2

−0.1

0.0

0.1

mirtilloorecchioscoiattoloovile

ditaleneve

melonecastagnaccio

Effe

ct o

f com

mun

ity si

ze p

er c

once

pt

Concepts sorted by the effect of community size

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 16/20

Page 17: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

By-concept random slopes for comm. age and income

−1.0 −0.5 0.0 0.5 1.0

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

●●

●●● ●●●●● ●● ●●● ● ●● ●● ●● ●● ●●● ● ● ●●● ●●●●● ●●● ●●●● ●●●●●● ●●● ●● ● ●● ●●● ●● ● ●● ● ●● ●● ●● ●●● ●● ●●● ●●● ●● ●●● ● ●●●● ●●●● ●● ● ●●● ●● ●● ●●● ● ● ●● ●● ● ●● ● ●●● ●●● ● ● ●● ●●●● ●● ● ●● ●●● ●● ●●●● ● ●●● ●●●

● ●●●● ● ●●●●●● ●

ricciocastagnaccio

ditalecimice

susinacoccagrattugia

orecchio

stollo

pimpinella

pigna

capronesfoglia

orzaiolo

ovoloramarro

Effect of average community income per concept

Effe

ct o

f ave

rage

com

mun

ity a

ge p

er c

once

pt

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 17/20

Page 18: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Discussion

Using a generalized additive mixed model (GAMM) to investigate lexicaldifferences between standard Italian and Tuscan dialects revealedinteresting dialectal patterns

GAMs are very suitable to model the non-linear influence of geographyThe regression approach allowed for the simultaneous identification ofimportant social, geographical and lexical predictorsBy including many concepts, results are less subjective than traditionalanalyses focusing on only a few pre-selected conceptsThe mixed-effects regression approach still allows a focus on individualconcepts

There are some drawbacks to GAMMs, however...gam and bam are computationally much more expensive than linearmixed-effects modeling using lmer (lme4 package)Model comparison is problematic when including random-effect smooths(i.e. using anova(gam1,gam2) is useless)

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 18/20

Page 19: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Some advertisements

More information about applying GAMMs in LVLC research:

Martijn Wieling (2012). A Quantitative Approach to Social andGeographical Dialect Variation. PhD thesis, Rijksuniversiteit Groningen.Available at http://www.martijnwieling.nl/phd

Workshop on Quantitative Linguistics and Dialectology: June 29, 2012Speakers: Mark Liberman, Harald Baayen, Roeland van Hout, Robert G.Shackleton, Jack Chambers, Piet van Reenen, Simonetta Montemagni,Charlotte Gooskens and Wilbert HeeringaRegistration: http://www.martijnwieling.nl/workshop

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 19/20

Page 20: Applying Generalized Additive Mixed Modeling: Tuscan Dialects … · 2014-04-08 · Applying Generalized Additive Mixed Modeling: Tuscan Dialects vs. Standard Italian Martijn Wieling

Thank you for your attention!

Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 20/20