natural language processing · 2020. 4. 28. · answers [ca 2000] appendix probi[ca 300] synchronic...
TRANSCRIPT
![Page 1: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/1.jpg)
Natural Language Processing
DiachronicsDan Klein – UC Berkeley
Includes joint work with Alex Bouchard-Cote, Tom Griffiths, and David Hall
![Page 2: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/2.jpg)
The Task
![Page 3: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/3.jpg)
Latin
focus
Lexical Reconstruction
French Spanish Italian Portuguese
feu fuego fuoco fogo
![Page 4: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/4.jpg)
Tree of Languages
§ We assume the phylogeny is known§ Much work in
biology, e.g. work by Warnow, Felsenstein, Steele…
§ Also in linguistics, e.g. Warnow et al., Gray and Atkinson…
http://andromeda.rutgers.edu/~jlynch/language.html
![Page 5: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/5.jpg)
Evolution through Sound Changes
camera /kamera/Latin
chambre /ʃambʁ/French
Deletion: /e/, /a/
Change: /k/ .. /tʃ/ .. /ʃ/
Insertion: /b/
Eng. camera from Latin, “camera obscura”
Eng. chamber from Old Fr. before the initial /t/ dropped
![Page 6: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/6.jpg)
Changes are Systematic
camra /kamra/
camera /kamera/
e ® _
numrus /numrus/
numerus /numerus/
e ® _
![Page 7: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/7.jpg)
Changes are Contextual
camra /kamra/
camera /kamera/
e ® _ / after stress
e ® _
![Page 8: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/8.jpg)
Changes Have Structure
cambra /kambra/
camra /kamra/
_ ® b / m_r
_ ® b
_ ® [stop x] / [nasal x]_r
![Page 9: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/9.jpg)
Changes are Systematic
English Great Vowel Shift (Simplified!)
e
i
a
ai
“time” = teem “time” = taim
![Page 10: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/10.jpg)
English Great Vowel Shift
![Page 11: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/11.jpg)
Diachronic Evidence
tonitru non tonotrutonight not tonite
Yahoo! Answers [ca 2000] Appendix Probi [ca 300]
![Page 12: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/12.jpg)
Synchronic (Comparative) Evidence
Key idea: changes occur uniformly across the lexicon
![Page 13: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/13.jpg)
The Data
![Page 14: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/14.jpg)
The Data§ Data sets
§ Small: Romance§ French, Italian, Portuguese, Spanish§ 2344 words§ Complete cognate sets§ Target: (Vulgar) Latin
FR IT PT ES
![Page 15: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/15.jpg)
The Data§ Data sets
§ Small: Romance§ French, Italian, Portuguese, Spanish§ 2344 words§ Complete cognate sets§ Target: (Vulgar) Latin
§ Large: Austronesian§ 637 languages§ 140K words§ Incomplete cognate sets§ Target: Proto-Austronesian
FR IT PT ES
![Page 16: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/16.jpg)
Austronesian
![Page 17: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/17.jpg)
Austronesian Examples
From the Austronesian Basic Vocabulary Database
![Page 18: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/18.jpg)
The Model
![Page 19: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/19.jpg)
Simple Model: Single Characters
CG CC CC GG
G
C GG
[cf. Felsenstein 81]
![Page 20: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/20.jpg)
Changes are Systematic
/fokus/
/fweɣo/
/fogo/
/fogo//fwɔko/
/fokus/
/fweɣo/
/fogo/
/fogo//fwɔko/
/kentrum/
/sentro/
/sentro/
/sentro//tʃɛntro/
![Page 21: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/21.jpg)
Parameters are Branch-Specific
focus /fokus/
fuego /fweɣo/
/fogo/
fogo /fogo/
fuoco /fwɔko/
qIB
IT ES PT
IB
LAqES
qIT qPT
[Bouchard-Cote, Griffiths, Klein, 07]
![Page 22: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/22.jpg)
Edits are Contextual, Structured
/fokus/
/fwɔko/
f# o
f# w ɔ
qIT
![Page 23: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/23.jpg)
Inference
![Page 24: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/24.jpg)
Learning: Objective
/fokus/
/fweɣo/
/fogo/
/fogo//fwɔko/
z
w
![Page 25: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/25.jpg)
Learning: EM§ M-Step
§ Find parameters which fit (expected) sound change counts
§ Easy: gradient ascent on theta
§ E-Step§ Find (expected) change
counts given parameters§ Hard: variables are string-
valued
/fokus/
/fweɣo/
/fogo/
/fogo//fwɔko/
/fokus/
/fweɣo/
/fogo/
/fogo//fwɔko/
![Page 26: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/26.jpg)
Computing Expectations
‘grass’
Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence
[Holmes 01, Bouchard-Cote, Griffiths, Klein 07]
![Page 27: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/27.jpg)
A Gibbs Sampler
‘grass’
![Page 28: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/28.jpg)
A Gibbs Sampler
‘grass’
![Page 29: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/29.jpg)
A Gibbs Sampler
‘grass’
![Page 30: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/30.jpg)
Getting Stuck
How could we jump to a state where the liquids /r/ and /l/ have a common
ancestor?
?
![Page 31: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/31.jpg)
Getting Stuck
![Page 32: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/32.jpg)
Efficient Sampling: Vertical Slices
Single Sequence
Resampling
Ancestry Resampling
[Bouchard-Cote, Griffiths, Klein, 08]
![Page 33: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/33.jpg)
Results
![Page 34: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/34.jpg)
Results: Romance
![Page 35: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/35.jpg)
Learned Rules / Mutations
![Page 36: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/36.jpg)
Learned Rules / Mutations
![Page 37: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/37.jpg)
Results: Austronesian
![Page 38: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/38.jpg)
Examples: Austronesian
[Bouchard-Cote, Hall, Griffiths, Klein, 13]
![Page 39: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/39.jpg)
Result: More Languages Help
Number of modern languages used
Mea
n ed
it di
stan
ceDistance from Blust [1993] Reconstructions
![Page 40: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/40.jpg)
Visualization: Learned Universals
*The model did not have features encoding natural classes
![Page 41: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/41.jpg)
Regularity and Functional Load
In a language, some pairs of sounds are more contrastive than others (higher functional load)
Example: English p/d versus t/th
High Load: p/d: pot/dot, pin/dindress/press, pew/dew, ...
Low Load: th/t: thin/tin
![Page 42: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/42.jpg)
Functional Load: Timeline1955: Functional Load Hypothesis (FLH): Sound changes are
less frequent when they merge phonemes with high functional load [Martinet, 55]
1967: Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] (Based on 4 languages as noted by [Hocket, 67; Surandran et al., 06])
Our approach: we reexamined the question with two orders of magnitude more data [Bouchard-Cote, Hall, Griffiths, Klein, 13]
![Page 43: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/43.jpg)
Regularity and Functional Load
Functional load as computed by [King, 67]
Data: only 4 languages from the Austronesian dataM
erge
r pos
terio
r pro
babi
lity
Each dot is a sound change identified by the system
![Page 44: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/44.jpg)
Regularity and Functional Load
Data: all 637 languages from the Austronesian data
Functional load as computed by [King, 67]
Mer
ger p
oste
rior p
roba
bilit
y
![Page 45: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/45.jpg)
Extensions
![Page 46: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/46.jpg)
Cognate Detection
/fweɣo/
/fogo/
/fwɔko/
/berβo/
/vɛrbo/
/vɛrbo/ /tʃɛntro/
/sentro/
/sɛntro/
p‘fire’
[Hall and Klein, 11]
![Page 47: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/47.jpg)
Grammar Induction
010203040506070
Dut
ch
Dan
ish
Swed
ish
Span
ish
Portu
gues
e
Slov
ene
Chi
nese
Engl
ish
WG NGRMG
IEGLAvg rel gain: 29%
[Berg-Kirkpatrick and Klein, 07]
![Page 48: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The](https://reader036.vdocuments.net/reader036/viewer/2022071420/6118b51b6184440621387cc5/html5/thumbnails/48.jpg)
Language Diversity
Why are the languages of the world so similar?
Universal grammar answer: Hardware constraints
Common source answer: Not much time has passed
[Rafferty, Griffiths, and Klein, 09]