aggregate vs. feature-based perspectives on dialect geography › ~nerbonne › outgoing ›...
TRANSCRIPT
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Aggregate vs. Feature-Based Perspectives onDialect Geography
John [email protected]
Center for Language and Cognition, University of Groningen
Language in Space: Geographic Perspectives on LanguageDiversity and Diachrony
NSF Workshop, LSA Linguistics InstituteUniversity of Colorado, Boulder, 23-4 July 2011
John Nerbonne [email protected] 1/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Groningen dialectology team!
Charlotte Gooskens, Peter Houtzagers, Hermann Niebaum, WilbertHeeringa, Jelena Prokic, Therese Leinonen, Martijn Wieling, MarcoSpruit, Peter Kleiweg, Christine Siedle, Jens Moberg, ......Sebastian Kurschner, Alexandra Lenz, Bob Shackleton, Renee vanBezooijen, ......Bob de Jonge, Agnes de Bie, Cornelius Hasselblatt...Simonetta Montemagni, Franz Manni, Petja Osenova, Esteve Valls,Lucija Simicic, Kristel Uiboaed, Boudewijn van den Berg
John Nerbonne [email protected] 2/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Overview
Old problems in dialectologyMassive variationCounterindicating signals
Aggregating signals (dialectometry)Levenshtein distance
Dialectological law enabled by aggregate viewSeguy’s curve
Features, “ranking isoglosses” (Chambers & Trudgill, p.97)
John Nerbonne [email protected] 3/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
One old problem in dialectology
Pronunciations are very variable— 87 different pronunciations of ich in the PAD
1 5Ic 5Ic˜
5¯Ic QI–k QIk @IS >@
˜Ig c EI–S
¯Ec˜k E–g E
˙Icff
E˙IS¯E
˙Ik Ek Ekh I I: IP Ic Ic
ffIc¯
IG IGff
IS ISff
IS¯I
¯c I
¯c¯
I¯G I
¯g I
¯k I
¯k. I
¯C I
¯ý I
˚k I–c I–g I–g. I–j I–k
I–C I–x I˙
I˙c¯
I˙: I
˙:c I
˙c I
˙X I
˙g I
˙g. I
˙k I
˙C I
˙ý Ig
Ij Ij˜
Ik Ikh IC Ixff Yc¯
Yý e >e¯
IG e–>Pk e– c e–g e
˙S—
e˙
>cj e
˙c e
˙G e
˙g e
˙j e
˙C eg ek e
>kx
˜i i: i:c i:c
˜ic
i– i–:>jc i–k
John Nerbonne [email protected] 4/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
A second old problem in dialectology
We receive noisy signals of provenance.
front/low V in Haus [p] (dark) vs. [>pf] [t] vs. [>ts] [k] vs. [x(c)]
“non-overlapping isoglosses”
John Nerbonne [email protected] 5/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Isoglosses seldom overlap
aggregate [S] (dark) vs. s [z] (dark) vs. [s] N d/t (dark)2nd shift (non-initially) (initially)
apical [r] (dark) final [n] drop (dark) medial [t] vs. s init. lenited /g/vs. uvular [ö] vs. retention
John Nerbonne [email protected] 6/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Why dialectometry?
Strengthen geographic signals by aggregatingSolve problems of earlier dialectology
Non-overlapping distributionsSelection of features too arbitrary“Atomism” (Coseriu), idiosyncratic words (Bloomfield)
Introduce replicable proceduresFollowing Seguy, Goebl, Schiltz, Kretzschmar, Shackleton, ...Seeking law-like relations in linguistic variation
John Nerbonne [email protected] 7/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Calculating dialect distances
To determine the aggregate distance between dialects:We determine the distance between each dialect pair for everysingle linguistic element (in sample, e.g. dialect atlas)
Perhaps just same (0) vs. different (1)... but we’ve developed more sensitive measures (below)
We sum these distances for every element (hundreds of them)Immediate result: place × place table of dialect differences
Seguy (1971), Goebl (1980s and on), many others
John Nerbonne [email protected] 8/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Dialectometric “feature ranking”
Chambers & Trudgill (1998) ask for a ranking of features (andisoglosses) in order to identify dialect boundaries.Implicit “feature ranking” in dialectometry: a feature that’sinstantiated n times in dialect atlas material is weighted n timesmore heavily than one that appears once.
Lexical items uniformly weightedPhonetic segment distances weighted in proportion to theirfrequency in the word list
Note that Goebl has also experimented with “inverse frequency”weighting of responses.
John Nerbonne [email protected] 9/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Aside: more sensitive pronunciation distance measure
Levenshtein distance enables analysis of phonetic transcriptionswithout manual alignment
—move from categorical to numerical analysis of data.One of the most successful methods to determine sequencedistance (Levenshtein, 1964)
biological molecules, software engineering, ...
Levenshtein distance: minimum number of insertions, deletionsand substitutions to transform one string into the otherSyllabicity constraint add: vowels never substitute for consonants
John Nerbonne [email protected] 10/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Example of the Levenshtein distance
mO@lk@ delete @ 1mOlk@ subst. O/E 1mElk@ delete @ 1mElk insert @ 1mEl@k
4
m O @ l k @m E l @ k
1 1 1 1
John Nerbonne [email protected] 11/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Example
Based on Dutch pronunciation data from theGoeman-Taeldeman-Van Reenen-Project data (GTRP; Goemanand Taeldeman, 1996)
We use 562 words for 424 varieties in the Netherlands
Wieling, Heeringa & Nerbonne (2007) An Aggregate Analysis ofPronunciation in the Goeman-Taeldeman-van Reenen-ProjectData. In: Taal en Tongval 59(1), 84-116
Calculating Levenshtein distances yields interesting soundcorrespondences contained in the alignments (more on that later)
Note that a 100-word comparison already yields about 500 soundcorrespondences
John Nerbonne [email protected] 12/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Distribution of sites
John Nerbonne [email protected] 13/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Analytical steps
Obtain the distances between each of the ≈ 90, 000 pairs ofvarieties
n.b. this involves 500× 52 segment comparisons≈ 1.1× 109 segment comparisons in total
Organize these in a 400× 400 tableSeek groups (dialect areas) or continuum-like relations, e.g. byapplying clustering or multi-dimensional scaling, respectively
John Nerbonne [email protected] 14/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Multi-Dimensional Scaling
Frisian
Frisian cities, Het Bildt
Westerkwartier
Stellingwerf
Low Saxon
Central Gelderland
Dutch Low Franconian
Flemish Low Franconian
John Nerbonne [email protected] 15/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
MDS dimensions → colors, projected to map
John Nerbonne [email protected] 16/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Noisy Clustering
BonnKöln 100
Iversheim56
AachenWinterspelt
55
Odenspiel
56
LohraWittelsberg 58
Allna100
HerbornseelbachOffdilln 100
99
DexbachNiederasphe 100
Rosenthal58
Frohnhausen
100
74
AltenbergSchraden 54
BockelwitzSchmannewitz 97
Linz60
GrünlichtenbergRoßwein 100
69
Lampertswalde
72
JonsdorfRammenau 88
Gersdorf72
65
AltlandsbergLippen 100
Groß Jamno100
Pretzsch
100
Neu Schadow
93
GerbstedtLandgrafroda 100
53
BorstendorfGornsdorf 100
Theuma96
Mockern
55
CursdorfOsterfeld
Wehrsdorf
56
BillingsbachZellingen 66
Altentrüdingen97
BempflingenIggingen 80
Schömberg100
BurgriedenOberhomberg
53
BruchHermeskeil 100
KruftSiebenbach 100
Mastershausen56
57
Hartenfels
56
BüdesheimEisenbach 73
Niedernhausen61
Vielbrunn
56
Lohrhaupten
83
EschelbronnPfaffenrot 83
Niederauerbach85
56
EnsheimMaxweiler
53
EbertshausenExdorf 100
TannWeyhers 100Helmers
100100
EichenhofenHermannsreuth 100
PeterskirchenSchachach 60
Gelting92
LangenbruckOberviehbach 59
PielenhofenTreffelstein 100
Ulbering
67
Hartenstein
60
KemmernOttowind 100
Schauenstein100
Weidenbach
71
Nürnberg
65
63
Oberau
62
Klafferstraß
70
Pöttmes
7875
MaibrunnRamsau 93
79
EinöllenUngstein 59
HorheimSeelbach 62
Endenburg−Lehnacker52
EngelsbachSchellroda 100
HönebachRinggau−Röhrda 84
Unterellen
63
Mörshausen
60
GroßwechsungenWieda 99
Groß Ballhausen86
100
Orferode
99
HöchstädtIgling 70
Wildpoldsried96
SchnepfenbachVolkershausen 71
ClausthalKleinbottwar
ObermaiselsteinOberwürzbach
83
AhrbergenWasbüttel 100Brelingen
76
AlberslohHaddorf 100
Lippramsdorf61
BrockhausenEngter 100
60
HohenkörbenWüllen 63
77
AltwarpBreddin
Klein Rossau60
GrünowVietmannsdorf 94
Falkenthal79
99
MirowSchönbeck 99
98
BenninWentorf 91
Groß MohrdorfWolgast 91
Hagen64
Kirch Kogel
69
GresenhorstHerzfeld 97
Jürgenshagen
68
Verchen
68
59
AstfeldFreden 74
GottsbürenOsterhagen 96
71
AtzendorfHundisburg 100
Götz94
JacobsdorfReetz 61
62
Ruhlsdorf
81
Benzingerode
100
JeverWangerooge 57
Barßel81
BremscheidHerdecke 60
HerrentrupReelkirchen 100
HesselteichValdorf 100
9256
DreekeHerßum 66
GroßenwieheSchwabstedt 100
Holmkjer100
Wasbek
65
HammahOiste 52
JesteburgKuhstedt 94
StöckenWarpe 100Adorf
BardenflethDiekhusen
EbstorfEversen
HohwachtHuddestorf
JeetzelOhrdorf
Osterbruch
88
LeuthWemb 100
83
100
Seeks groups in data, enabling comparison to older dialectologywhich sought areasOnly bootstrap (or noisy) clustering to avoid instability
John Nerbonne [email protected] 17/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Projecting groups to geography
Den Burg
SchiermonnikoogOosterend
Leeuwarden
Grouw
Groningen
Heerhugowaard
Haarlem
Delft
StaverenSteenwijk
Urk
Hattem
Amersfoort
Assen
Emmen
Itterbeck
Lochem
Brugge
Veurne
Middelburg
Gent
Vianen
Zevenbergen
Kalmthout
Mechelen
Groesbeek
Helmond
Venlo
Overpelt
Roeselare
SteenbeekGeraardsbergen Tienen
Kerkrade
Aubel
John Nerbonne [email protected] 18/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Large body of dialectometric work—positive aspects
Dutch, German, American English, Norwegian, Swedish,Afrikaans, Sardinian, Tuscan, Catalan, Bulgarian, Croatian,Estonian, Sino-Tibetan, Chinese, Central Asian (Turkic &Indo-Iranian), ...Development of consistency measure (Cronbach’s α) indictingwhether data set is sufficiently largeNovel reflection, work on validation aimed at assessing degree ofdetection of SIGNALS OF PROVENANCE
Gooskens & Heeringa (2004) Perceptive Evaluation of LevenshteinDialect Distance Measurements using Norwegian Dialect Data.Language Variation and Change 16(3), 189-207.
John Nerbonne [email protected] 19/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Criticisms of dialectometry, esp. Levenshtein-basedwork
Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)
—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?
... progress on all fronts, but presentation would take too long—question and discussion period for those interested
John Nerbonne [email protected] 20/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
The Influence of Geography
Regression designDependent variable: varietal distance, as measured by aggregatecategorical distance or Levenshtein distanceIndependent variable: geographical distance, regarded as anoperationalization of the chance of social contactStatistical cautions:
1 correlations involving averages are inflated— but we’re interested in the entire varieties (dialects)
2 distances are not independent, so significance may be inflated— Mantel tests
John Nerbonne [email protected] 21/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Inspiration: Jean Seguy
Seguy (1971) La relation entre la distance spatiale et la distancelexicale. Revue de Linguistique Romane 35(138), 335-357:Aggregate variation increases sublinearly with respect togeography
COURSE MOYENNE
Y = 36Vlog(x + 11
so
.0
J
10
1
~ 1. 6 . I) IS 10 1~ 30 3~ .0 .~ 50 55 60 ~ 10 1S 10 is 90 95 100 IDS 110
John Nerbonne [email protected] 22/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Sublinear spread is general
0 100 300 500
0.00
0.10
0.20
Bantu
0 100 200 300 400 500
0.00
00.
002
0.00
4
Bulgaria
0 200 400 600 800
0.04
0.08
0.12
Germany
0 200 600 1000
0.0
0.2
0.4
LAMSAS / Lowman
0 50 100 200 300
0.01
0.03
0.05
0.07
The Netherlands
0 100 200 300 400 500
1.0
2.0
3.0
4.0
Norway
John Nerbonne [email protected] 23/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Aside: Trudgill’s “Gravity hypothesis”
Moon
DeimosPhobos
Venus
Earth
Mars
Sun
According to Trudgill (1972) diffusion follows an inverse square
law, with the consequence that linguistic distance should likewise
increase with the square of the distance. Population size plays
the role of mass.
John Nerbonne [email protected] 24/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Trudgill’s “Gravity hypothesis”
Sublinear aggregate relation incompatible with a quadraticinfluence (on individual features)
J.Nerbonne (2010) Measuring the Diffusion of Linguistic Change. Phil.Transactions of the Royal Society B: Biological Sciences 365.
John Nerbonne [email protected] 25/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
How much does distance influence language?
Area Corr.(l,geo) r2
Gabon Bantu 0.47 0.22Bulgaria 0.49 0.24Germany 0.57 0.32Eastern U.S. 0.51 0.26Netherlands 0.62 0.38Norway 0.41 0.16
Norwegian ling. dist. correlates better w. travel time in 1900 (r = 0.54)Gooskens (2005) Dialectologia et Geolinguistica 13.
Adding areas increases explained variance 50% (forthcoming in aFreiburg volume)
John Nerbonne [email protected] 26/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Geographic influence on language
Geography accounts for 33− 57% of aggregate linguistic variation.General — sublinear — characterization of relation betweengeographical distance and linguistic differencesLike population geneticists’ “isolation by distance” (Wright, 1943;Malecot, 1955)
John Nerbonne [email protected] 27/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Features? (assuming aggregate analysis)
Argumentum ad auctoritatem Groningen software supports freesearch (with measures of “importance”)Post-hoc “feature mining”: We can look for words that correlatewith significant dimensions of MDS solutions (of aggregateanalyses).Bipartite spectral graph partitioning (like two-dimensional factoranalysis).
Begin with matrix of varieties × featuresCluster varieties and features simultaneously.
Mixed modelsInclude feature choice (words) as random-effect factor in regressionmodel.
John Nerbonne [email protected] 28/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
“Importance” of feature wrt area
Representative(f,a) ≈ relative frequency of f among sitesDistinctive(f,a) ≈ proportion of occurrences of f in a as opposed to
outside aImportance(f,a) is average of representativeness and distinctiveness
John Nerbonne [email protected] 29/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
MDS-based feature-mining
John Nerbonne [email protected] 30/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Co-clustering bi-partite spectral graph
-0.32
-0.34
-0.34
-0.23
0.23
0.34
0.34
-0.32
0
0.32
0.32
Details during discussion if wanted.
John Nerbonne [email protected] 31/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
“Mixed models”: modeling each word
LD = 0.00 + 0.01WF − 0.005PS + 0.004PA (general model)LD = −0.01 + 0.01WF + 0.010PS + 0.004PA (word: bier )LD = 0.20 + 0.01WF − 0.008PS + 0.004PA (word: zijn)
Ongoing work by Martijn Wieling (submitted)John Nerbonne [email protected] 32/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
A caution: dialect continua
Old vs. young speakers in Sweden (SveDia, Therese Leinonen, 2010)
“Feature ranking” could create spurious dialect areas, even wherescientific consensus sees continua.
John Nerbonne [email protected] 33/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Features in aggregate analysis
Aggregate perspective enables identification & formulation ofgeneral law: distance models explain 22%− 38% of aggregatelinguistic variation.
Areal distinctions a bit collinear, but add (≈ 50%).
Features naturally ranked in dialectometric view, either as uniform,or as reflected in item sample / lexiconSeveral means of identifying and ranking featuresEmerging questions:
What is the linguistic structure of the dialect differences we find?Do typological constraints play a (confounding) role?Can we tease apart geographical and historical explanations, andhow?
Try Gabmap! www.gabmap.nl
John Nerbonne [email protected] 34/35
Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?
Questions?
Thank You!
John Nerbonne [email protected] 35/35