testing functional explanations of word order universalsstanford.edu/~mhahn2/cgi-bin/files/osf-cuny...

Testing Functional Explanations of Word Order Universals

Michael Hahn Richard FutrellStanford UC Irvine

1

(Greenberg 1963)2

U3: ‘Languages with dominant VSO order are alwaysprepositional.’

3


U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’ 4


U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’

`Relative position of adposition & noun ~relative position ofverb & object’

5

OV languages with postpositions

VO languages with prepositions

6Source: https://wals.info/feature/95A

Why do these universals hold?

Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)

Facilitation of language processing? (Dryer 1992, Hawkins 1994)

Make languages learnable? (Culbertson 2017)

8




Approach: Test functional explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.


9

Three Efficiency Measures

Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003; Gibson 1998)

Surprisal (Gildea and Jaeger, 2015; Ferrer-i Cancho, 2017)

Parsability (Hawkins, 1994, 2003)

10



11

12

Dependency Length Minimization: Dependencies are shorter than expected at random

(Futrell et al., 2015)Sentence Length

Dep

ende

ncy

Leng

thRandom orderings

Real English

Theoretical Optimum

Idea: In certain models, short dependencies reduce memory load (Gibson 1998)

13

(Futrell et al., 2015)

Dependency Length Minimization: Dependencies are shorter than expected at random

Idea: In certain models, short dependencies reduce memory load (Gibson 1998)

Argued to explain several of the Greenberg correlations (Rijkhoff, 1986; Hawkins, 1994, 2003)



14



21 1

15



21 1+ + = 4

16

Three Efficiency MeasuresSurprisal

Surprisal(w1...wi-1) = -Σi log P(wi|w1...wi-1)

17



18

Rea

ding

Tim

e

Surprisal (Smith and Levy 2013)19



Estimated using recurrent neural networks, the strongest existing methods for estimating surprisal and predicting reading times (Frank 2011; Goodkind & Bicknell

2018).

20

Three Efficiency MeasuresParsability

Mary has two green books.

21



Parsability(utterance) := log P(tree | utterance)

22



Parsability(utterance) := log P(tree | utterance)

Estimated using a neural network model (Dozat and Manning 2017)

with extremely generic architecture.23

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance

Cost of processing utterance

λ

Combining Parsability + Surprisal

24

Formalizes Zipf’s (1949) Forces of Diversification & Unification

Utility Informativity Cost-= λ


25


Amount of Meaning that can be extracted from utterance ~ Parsability


~ Surprisal



26


λ can take values in (0,1)



~ Surprisal



27


λ can take values in (0,1)

We will give similar weight to both factors (λ=0.9).



~ Surprisal


Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

λ


28



~ Surprisal

Utility Informativity Cost-=Amount of Meaning that can be extracted from utterance ~ Parsability


~ Surprisal


Formalized in Rational-Speech Acts models (Frank and Goodman 2012)

λ


29



Formalized in Rational-Speech Acts models (Frank and Goodman 2012)

Related to Signal Processing (Rate-Distortion Theory, Information Bottleneck)

λ




~ Surprisal

30

Why do the universals hold?


Facilitation of human communication? (Dryer 1992, Hawkins 1994)

Approach: Test processing explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.


31

Testing Functional Explanations

Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged

32



Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences

33




Instead: optimize word order rules of entire languages

34




Instead: optimize word order rules of entire languages

That is: optimized languages have optimized but internally consistent grammatical regularities in word order, and agree with an actual natural language in all other respects.

35

Mary has two green books

nsubj

obj

nummod

amod

Dependency Corpus

36


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus

37


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

38


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies


0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

“Object follows verb”

39


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies


0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

“Adjective precedes noun”


40


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies


0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

“Adjective precedes noun”


“Numerals follow adjectives & precede nouns”

41


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreenbooks

Counterfactual Corpus


0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

42


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreenbooks



0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

-0.2

0.8

Each parameter setting generates a different counterfactual corpus.

43


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreen books



0.9

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.1

0.5

0.2


44


nsubj

obj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhas twogreenbooks



0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.95

04.2

0.82


45

Dependency Length Surprisal

Parsability

2.35.81.8

We compute processing measures on counterfactual corpora.

46


Parsability

2.35.81.8

Each parameter setting results in different values for the processing measures.

47


Parsability

2.94.52.9


48


Parsability

3.47.81.2


49


Parsability

3.47.81.2


Which settings optimise the measures?

50


Parsability

3.47.81.2


Which settings optimise the measures?

Do the optimised settings replicate the Greenberg correlations?

51

For each objective, find parameters that optimise it.

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.95

04.2

0.82

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.85

0.1

0.22

Minimize Dep. Length Minimize Surprisal

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

0.5

0.8

NOUN ADJamod0.21

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.45

0.4

0.32

Maximize Parsability Optimize Pars.+Surp.

52


Repeat this for corpora from 51 real languages from Universal Dependencies Project.

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.95

04.2

0.82

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.85

0.1

0.22


NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

0.5

0.8

NOUN ADJamod0.21

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.45

0.4

0.32


53


Repeat this for corpora from 51 real languages from Universal Dependencies Project.

0.1

0.95

04.2

0.82

0.1

0.85

0.1

0.22


NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

NOUN ADJ 0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

0.5

0.8

0.7

0.5

0.8

0.21

0.45

NOUN ADJ 0.1

NOUN

NOUN ADJ 0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

0.5

0.8

NUMnummod

VERB NOUNnsubj

VERB NOUNobj

...

0.7

0.5

0.8

0.4

0.32


1. How do the objectives compare?2. Which universals are predicted?


54

Surprisal and Parsability minimize Dependency Length

55


56


Functional Utility predicts Dependency Length Minimization.

57

Better Parsability

Lower Surprisal

z-transformed on the level of languages

Language optimizes Surprisal and Parsability

58

Better Parsability

Lower Surprisal

Random Grammars


59

Better Parsability

Lower Surprisal

Random Grammars

Grammars fit to Real Orderings


Better Parsability 60

Better Parsability

Lower Surprisal

Random Grammars

Optimized for Surprisal

Optimized for Parsability

Optimized for Parsability+Surprisal

Grammars fit to Real Orderings


61

(Dryer 1992 in Language) 63

(Dryer 1992 in Language)


64

We formalize the correlations in the Universal Dependencies format.

65

(Dryer 1992 in Language) 66

X

XX

67


For any word order grammar, we can then check which correlations it satisfies.

68

69



For any word order grammar, we can then check which correlations it satisfies.

70

Are the universals satisfied by models fit to the actual orderings for our 51 languages?

%

71

%

72



%

73


Prevalence of SVO (Dryer 1992)

Limitation of formalisation

%

74

Percentage of grammars optimized for each objective satisfying the universal

76

Percentage of grammars optimized for each objective satisfying the universal

Assessing Significance:X = “Object precedes verb”Y = “Object-patterner precedes verb-patterner”

Logistic model:Y ~ X + (1+X|family) + (1+X|language)

78

82

Predictions largely complementary

Predictions mostly agree

84


Functional Utility replicates predictions of Dependency Length Minimization.

85


Functional Utility replicates predictions of Dependency Length Minimization.Both measures predict most of the correlation universals.

86

Two ObjectivesUtility

21 1+ + = 4

Dependency Length Minimization

Two Objectives

Broad description of functional efficiency in general

21 1+ + = 4


Particular component of complexity

Utility

Two Objectives

Broad description of functional efficiency in general

Utility2

1 1+ + = 4


Particular component of complexity

Our results support the idea that Dependency Length Minimization emerges from optimizing for Parsability and Predictability (Futrell et al. 2017).





90





91





92

● These ideas need not be mutually exclusive





93

● These ideas need not be mutually exclusive

● If UG or learnability are relevant, our results suggest they may be tilted towards efficiency.

Conclusion

● Tested explanations of Greenberg correlation universals in terms of efficiency of human language processing

● Using corpora from 51 languages, constructed counterfactual optimized languages

● Most of the correlations can be derived from pressure to shorten dependencies, decrease surprisal, or increase parsability

● Clear evidence for functional explanations of word order universals

94

testing functional explanations of word order universalsstanford.edu/~mhahn2/cgi-bin/files/osf-cuny...

Documents