applying generalized additive mixed modeling: tuscan dialects … · 2014-04-08 · applying...
TRANSCRIPT
Applying Generalized Additive Mixed Modeling:Tuscan Dialects vs. Standard Italian
Martijn Wieling1, Simonetta Montemagni2, John Nerbonne1 and HaraldBaayen3
1University of Groningen, Center for Language and Cognition Groningen
2Istituto di Linguistica Computationale “Antonio Zampolli”, CNR, Pisa
3Eberhard Karls University, Tübingen; University of Alberta, Edmonton
Leuven Statistics Days 2012, June 7 - 8, 2012
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 1/20
Overview
IntroductionGeneralized Additive Mixed ModelingStandard Italian and Tuscan dialects
Material
Methods
Results
Discussion
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 2/20
Generalized Additive Modeling (1)
linear model : linear relationship between predictors and dependentvariable: y = a1x1 + ...+ anxn
Non-linearities via explicit parametrization: y = a1x21 + a2x1 + ...
generalized linear model : linear relationship between predictors anddependent variable via link function: g(y) = a1x1 + ...+ anxn
Example: logistic regression for predicting a binary outcome
generalized additive model (GAM): relationship between individualpredictors and (possibly transformed) dependent variable is estimated bya non-linear smooth function: g(y) = s(x1) + s(x2, x3) + a4x4 + ...
multiple predictors can be combined in a (hyper)surface smooth
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 3/20
Generalized Additive Modeling (2)
Advantage of GAM over manual specification of non-linearities: theoptimal shape of the non-linearity is determined automatically
appropriate degree of smoothness is automatically determined on the basisof cross validation to prevent overfitting
Choosing a smoothing basisSingle predictor or isotropic predictors: thin plate regression spline
Efficient approximation of the optimal (thin plate) spline
Combining non-isotropic predictors: tensor product spline
Generalized Additive Mixed Modeling:Random effects can be treated as smooths as well (Wood, 2008)R: gam and bam (package mgcv)
For more (mathematical) details, see Wood (2006)
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 4/20
Standard Italian and Tuscan dialects
Standard Italian originated in the 14th century as a written languageIt originated from the prestigious Florentine varietyThe spoken standard Italian language was adopted in the 20th century
People used to speak in their local dialect
In this study, we investigate the relationship between standard Italian andTuscan dialects
We focus on lexical variationWe attempt to identify which social, geographical and lexical variablesinfluence this relationship
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 5/20
Material: lexical data
We used lexical data from the Atlante Lessicale Toscano (ALT)We focus on 213 locations (> 2000 informants) and 170 conceptsWe grouped the informants based on age (young/old: born after/before1930) and used the majority’s lexical form
It was computationally not feasible to include the informants individually
Total number of cases: 69,259For every case, we identified if the lexical form was different from standardItalian (1) or the same (0)
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 6/20
Geographic distribution of locations
S
FP
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 7/20
Material: additional data
In addition, we obtained the following information:Number of inhabitants in each locationAverage income in each locationAverage age in each locationFrequency of each concept
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 8/20
Modeling geography’s influence with a GAM
# logistic regression: family="binomial"> library(mgcv) # current version 1.7-17> geo = gam(NotStd ~ s(Lon,Lat), data=tusc, family="binomial", method="ML")> vis.gam(geo,view=c("Lon","Lat"),plot.type="contour",color="terrain",...)
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
Contour plot
Longitude
Latit
ude
−0.
5
−0.5
−0.4
−0.4
−0.4
−0.3
−0.3
−0.3
−0.2
−0.2
−0.2
−0.1
−0.1
−0.1
−0.1
0
0
0
0.1
0.1
0.1
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 9/20
Adding a random intercept to create a GAMM
# bam is quicker and less memory-intensive than gam (here 0.3 GB vs. 1.8 GB)> model = bam(NotStd ~ s(Lon,Lat) + s(Concept,bs="re"),
data=tusc, family="binomial", method="ML")> summary(model,freq=F)
Family: binomialLink function: logit
Formula:NotStd ~ s(Lon,Lat) + s(Concept,bs="re")
Parametric coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1859 0.1302 -1.428 0.153
Approximate significance of smooth terms:edf Ref.df Chi.sq p-value
s(Lon,Lat) 25.42 27.72 703.5 <2e-16 ***s(Concept) 167.47 168.00 15797.1 <2e-16 ***
R-sq.(adj) = 0.319 Deviance explained = 26.8%ML score = 97999 Scale est. = 1 n = 69259
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 10/20
Adding a random slope to a GAMM
> model2 = bam(NotStd ~ s(Lon,Lat) + s(Concept,bs="re")+ s(Concept,CommSize,bs="re"),
data=tusc, family="binomial", method="ML")> summary(model2,freq=F)
Family: binomialLink function: logit
Formula:NotStd ~ s(Lon,Lat) + s(Concept,bs="re") + s(Concept,CommSize,bs="re")
Parametric coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1865 0.1307 -1.427 0.154
Approximate significance of smooth terms:edf Ref.df Chi.sq p-value
s(Lon,Lat) 24.82 27.42 605.9 <2e-16 ***s(Concept) 167.47 168.00 15816.6 <2e-16 ***s(Concept,CommSize) 115.27 168.00 571.8 <2e-16 ***
R-sq.(adj) = 0.323 Deviance explained = 27.3%ML score = 97869 Scale est. = 1 n = 69259
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 11/20
Varying geography’s influence based on concept freq.
Wieling, Nerbonne and Baayen (2011, PLoS ONE) showed that the effectof word frequency varied depending on geographyHere we explicitly include this in the GAMM and we also allow forvariation per age group
> m = bam(NotStd ~ te(Lon, Lat, Freq, by=IsOld, d=c(2,1)) + IsOld + CommSize+ ..., data=tusc, family="binomial", method="ML")
The results will be discussed next...Note that the exact random-effect structure is subject to change, as therewere some changes made to the mgcv package recently, affecting p-valuecalculations of the random-effect smooths (i.e. use version ≥ 1.7-17)
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 12/20
Results: fixed effects and smooths
Estimate Std. Error z-value p-valueIntercept -0.4129 0.13950 -2.960 0.003
Old vs. young 0.4407 0.01925 22.896 < 0.001Community size (log) -0.1154 0.02658 -4.343 < 0.001
Est. d.o.f. Chi. sq. p-valueGeo × concept frequency (old) 53.88 221.5 < 0.001
Geo × concept frequency (young) 59.48 326.6 < 0.001
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 13/20
A complex geographical pattern
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
Low freq. concepts (old speakers)
Longitude
Latit
ude 2.2
2.4 2.6
2.6
2.6
2.8
2.8
2.8
3
3
3
3 3
.2
3.4
3.4
3.6
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
Mean freq. concepts (old speakers)
Longitude
Latit
ude
2
2.2
2.8
2.8
3
3
3
3
3.2
3.2
3.4
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
High freq. concepts (old speakers)
Longitude
Latit
ude
1.8
2.4
2.6
2.6
2.8
2.8
2.8
3
3
3
3.2 3.4
3.6
3.8
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
Low freq. concepts (young speakers)
Longitude
Latit
ude 2
2.2
2.4
2.6
2.6
2.6
2.8
2.8
2.8
3
3 3
.2
3.4
3.4
3.6 10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
Mean freq. concepts (young speakers)
Longitude
Latit
ude
2
2.2 2.4 2.4
2.4
2.6
2.6
2.6
2.6
2.8
2.8
3
10.0 10.5 11.0 11.5 12.0
42.5
43.0
43.5
44.0
High freq. concepts (young speakers)
Longitude
Latit
ude
2
2.2
2.2
2.4
2.4
2.4
2.6
2.6
2.8
2.8
3 3
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 14/20
Results: random effects
Factors Random effects Std. dev. p-valueLocation Intercept 0.2410 < 0.001Concept Intercept 1.7748 < 0.001
Average community income (log) 0.3127 < 0.001Average community age (log) 0.2482 < 0.001Community size (log) 0.1166 0.006
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 15/20
By-concept random slopes for community size
−0.4
−0.3
−0.2
−0.1
0.0
0.1
mirtilloorecchioscoiattoloovile
ditaleneve
melonecastagnaccio
Effe
ct o
f com
mun
ity si
ze p
er c
once
pt
Concepts sorted by the effect of community size
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 16/20
By-concept random slopes for comm. age and income
−1.0 −0.5 0.0 0.5 1.0
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
●●
●●● ●●●●● ●● ●●● ● ●● ●● ●● ●● ●●● ● ● ●●● ●●●●● ●●● ●●●● ●●●●●● ●●● ●● ● ●● ●●● ●● ● ●● ● ●● ●● ●● ●●● ●● ●●● ●●● ●● ●●● ● ●●●● ●●●● ●● ● ●●● ●● ●● ●●● ● ● ●● ●● ● ●● ● ●●● ●●● ● ● ●● ●●●● ●● ● ●● ●●● ●● ●●●● ● ●●● ●●●
● ●●●● ● ●●●●●● ●
●
ricciocastagnaccio
ditalecimice
susinacoccagrattugia
orecchio
stollo
pimpinella
pigna
capronesfoglia
orzaiolo
ovoloramarro
Effect of average community income per concept
Effe
ct o
f ave
rage
com
mun
ity a
ge p
er c
once
pt
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 17/20
Discussion
Using a generalized additive mixed model (GAMM) to investigate lexicaldifferences between standard Italian and Tuscan dialects revealedinteresting dialectal patterns
GAMs are very suitable to model the non-linear influence of geographyThe regression approach allowed for the simultaneous identification ofimportant social, geographical and lexical predictorsBy including many concepts, results are less subjective than traditionalanalyses focusing on only a few pre-selected conceptsThe mixed-effects regression approach still allows a focus on individualconcepts
There are some drawbacks to GAMMs, however...gam and bam are computationally much more expensive than linearmixed-effects modeling using lmer (lme4 package)Model comparison is problematic when including random-effect smooths(i.e. using anova(gam1,gam2) is useless)
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 18/20
Some advertisements
More information about applying GAMMs in LVLC research:
Martijn Wieling (2012). A Quantitative Approach to Social andGeographical Dialect Variation. PhD thesis, Rijksuniversiteit Groningen.Available at http://www.martijnwieling.nl/phd
Workshop on Quantitative Linguistics and Dialectology: June 29, 2012Speakers: Mark Liberman, Harald Baayen, Roeland van Hout, Robert G.Shackleton, Jack Chambers, Piet van Reenen, Simonetta Montemagni,Charlotte Gooskens and Wilbert HeeringaRegistration: http://www.martijnwieling.nl/workshop
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 19/20
Thank you for your attention!
Martijn Wieling, Simonetta Montemagni, John Nerbonne and Harald Baayen Applying Generalized Additive Mixed Modeling 20/20