language typology and areal linguistics - uni · pdf filelinguistic universals ... yiru...
TRANSCRIPT
Language Typology and Areal Linguistics
Yiru
July 13, 2016
Yiru Language Typology July 13, 2016 1 / 26
Overview
1 Introduction
2 Typologically-based Clusters
3 Areal Linguistics
Yiru Language Typology July 13, 2016 2 / 26
Language similarity
Why are some languages more alike than others?
the languages may be related ”genetically”.derived from a common ancestor language
the similarities may be due to chance.linguistic universals
the languages may be related areally.due to sharing
Yiru Language Typology July 13, 2016 3 / 26
Language similarity
Differences between the concepts of genetic relatedness and languagesimilarities lead us to the following questions:
If we cluster languages based only on their typological features, howdo the induced clusters compare to phylogenetic groupings?
How well do induced clusters and genetic families perform inpredicting values for typological features?
What typological features tend to stay the same within languagefamilies, and what features are likely to differ?
Yiru Language Typology July 13, 2016 4 / 26
WALS(World Atlas of Language Structures)
The WALS project consists of a database that catalogs linguistic featuresfor
over 2,556 languages in 208 language families
using 142 features in 11 different categories.
Data sparsity: only 16% of the cells are filled-presents serious problems to clustering algorithms
Yiru Language Typology July 13, 2016 5 / 26
Pruning Methods
Pruning the data to produce a smaller but denser subset
Prune Languages by Minimum Featuresrequire languages have a minimum of 25 features for the whole-worldset, or 10 features for comparing across subfamilies
Prune Features by Minimum Coveragepruning features that do not cover more than 10% of the selectedlanguages in the whole-world set, and 25% in comparisons acrosssubfamilies.
Use a Dense Language Family
Yiru Language Typology July 13, 2016 6 / 26
Features and Feature Values
the actual representation of the features
values cannot be treated using distance measures: Binarization
Yiru Language Typology July 13, 2016 7 / 26
Experimental Setup
Q1: how do induced clusters compare to phylogenetic groupings?Clustering Methods
k-medoids algorithm
methods from the CLUTO: repeated-bisection (rb), a k-meansimplementation (direct), an agglomerative algorithm (agglo) usingUPGMA to produce hierarchical clusters, and bagglo, a variant ofagglo
Similarity Measures
CLUTOs default cosine similarity measure (cos)
shared overlap = #FeatureswithSameValues#FeaturesBothFilledOutinWALS
Yiru Language Typology July 13, 2016 8 / 26
Clustering Performance Metrics
The genetic families as the gold standard
Rand Index
Cluster Precision, Recall, and F-Score
Yiru Language Typology July 13, 2016 9 / 26
Prediction Accuracy
Q2: how do induced clusters and genetic families compare in predictingthe values of features for languages in the same group?Q3: what typological features tend to stay the same within relatedfamilies?
Prediction accuracy:
use 90% of the filled cells to build clusters
predicted the values of the remaining 10% of filled cells
the accuracy is calculated by comparing these predicted values with theactual values in the gold standard
Yiru Language Typology July 13, 2016 10 / 26
Results & Analysis
Cluster Similarity
Yiru Language Typology July 13, 2016 11 / 26
Results & Analysis
Prediction Accuracy
Yiru Language Typology July 13, 2016 12 / 26
Results & Analysis
Prediction Accuracy
Yiru Language Typology July 13, 2016 13 / 26
Results & Analysis
Feature Selection
Yiru Language Typology July 13, 2016 14 / 26
Error Analysis
Language Similarity vs. Genetic
Yiru Language Typology July 13, 2016 15 / 26
Error Analysis
WALS as the Dataset
The Feature Set in WALS
Data Sparsity and Shared Features
Yiru Language Typology July 13, 2016 16 / 26
Areal Linguistics
The use of areas improves genetic reconstruction of languages according toa variety of metrics.Basic ideas: develop a Bayesian model of typology that allows for
the existence of linguistic areas
preference for some feature to be shared areally
to show that reconstructing language family trees is significantly aided byknowledge of areal features
Yiru Language Typology July 13, 2016 17 / 26
Areal Linguistics
some of the well-known linguistic areas
The Balkans: Albanian, Bulgarian, Greek, Macedonian, Rumanianand Serbo-Croatian. (Sometimes: Romani and Turkish)
The Baltic: Baltic languages, Baltic German, and Finnic languages(especially Estonian and Livonian).
linguistic features most easily shared areally
Ross (1988): nouns > verbs > adjectives > syntax >non − boundfunctionwords > boundmorphemes > phonemes
Curnow (2001): 15 categories of borrowable features, phonetics(rare), phonology (common), lexical (very common)
Yiru Language Typology July 13, 2016 18 / 26
A Bayesian Model for Areal Linguistics
Pitman-Yor process for modeling linguistic areas
Kingmans coalescent for modeling linguistic phylogeny
Yiru Language Typology July 13, 2016 19 / 26
Identifying Language Areas
2
Yiru Language Typology July 13, 2016 20 / 26
Identifying Areal Features
Yiru Language Typology July 13, 2016 21 / 26
Genetic Reconstruction
Yiru Language Typology July 13, 2016 22 / 26
Genetic Reconstruction
Yiru Language Typology July 13, 2016 23 / 26
Conclusion
1. Comparing clusters derived from typological features to genetic groupsin the worlds languages
the induced clusters look very different from genetic grouping
despite the differences, induced clusters show similar, or even greaterlevels of typological similarity than genetic grouping
2. The use of areas improves genetic reconstruction of languages
Yiru Language Typology July 13, 2016 24 / 26
References
Ryan Georgi, Fei Xia, William Lewis (2001)
Comparing Language Similarity across Genetic and Typologically-Based Groupings
Hal Daume III(2009)
Non-Parametric Bayesian Areal Linguistics
Yiru Language Typology July 13, 2016 25 / 26
Thank You!
Yiru Language Typology July 13, 2016 26 / 26