learning parametric stress without domain-specific mechanisms

Learning parametric stress without domain-specific

mechanismsAleksei Nazarov (Harvard University)

Gaja Jarosz (University of Massachusetts Amherst)

Annual Meeting on Phonology 2016, USC

Please feel free to download these slides at:

http://bit.ly/2erv1Oa

or

http://alekseinazarov.org/papers

2


http://alekseinazarov.org/papers

Introduction: Parameters

• Chomsky (1981): Principles and Parameters• UG: universals (principles) and finite choices (parameters)

• Language learner’s task: find settings of parameters

• Applied to stress systems: Dresher and Kaye (1990), Hayes (1995)• UG specifies:

• Directionality: L-to-R or R-to-L?

• Foot form: Trochee or Iamb?

• Extrametricality: Yes: Right• …

3

bit.ly/2erv1Oa


Introduction: Parameters

• Chomsky (1981): Principles and Parameters• UG: universals (principles) and finite choices (parameters)

• Language learner’s task: find settings of parameters

• Applied to stress systems: Dresher and Kaye (1990), Hayes (1995)• Learning :Multiple parses possible

• ( ⇒ Left, Iambs…

• (ˌ ⇒ Right, Trochees…

• <>(ˌ ⇒ Extrametricality left, Trochees…

4

bit.ly/2erv1Oa


Introduction: Previous proposals

• For stress parameters: domain-specific learning mechanisms argued to be necessary, e.g., Dresher and Kaye (1990):• Parameters set in innately specified order

• Each parameter has default value

• Marked value triggered by “cue”: innately specified configuration• E.g., QS starts out set to Off. If corpus contains two words of same length with different

stress, set QS to On.

• See similar work on “triggering” in learning syntax (Gibson and Wexler 1994, Berwick and Niyogi 1996, Lightfoot 1999)

5

bit.ly/2erv1Oa


Introduction: Previous proposals

• Statistical learning of parameters: argued to be insufficient for stress• Naïve Parameter Learner (NPL; Yang 2002) domain-general learner for

parameters (syntax or phonology)

• Pearl (2011) shows failure of NPL on stress

• Argues we need a learner (Pearl 2008, 2009) that has:• Parameter acquisition ordering

• A way to find unambiguous evidence for parameters: cues (Dresher and Kaye) or parsing (Fodor 1998, Sakas and Fodor 2001)

6

bit.ly/2erv1Oa


Introduction: Current proposal

• Proposal: Slightly richer statistical learning model, no domain-specific mechanisms (see Gould 2015 for a different proposal in this direction)

• Expectation Driven Parameter Learner (EDPL; based on Jarosz

submitted)• Parameter update sensitive to ambiguity between parameter settings

• Relies on learning principles from Expectation Maximization (EM; Dempster et al. 1977): • EM gradually increases probability of successful analyses

7

bit.ly/2erv1Oa


Introduction: Testing proposal

• EDPL and NPL (no domain-specific mechanisms) tested on languages predicted by Dresher and Kaye (1990)• First typologically extensive tests for NPL

• EDPL massively outperforms NPL (96.0% success vs. 4.3% success)

• We argue that conclusions about necessity of domain-specific mechanisms are premature• EDPL can learn stress parameters without domain-specific mechanisms

• Increase in performance compared NPL obtained by using EM principles

8

bit.ly/2erv1Oa


The Learner

NPL & EDPL overview

NPL EDPL

Stochastic parameter grammar: match vs. mismatch

Grammar incrementally updated by Linear Reward-Penalty Scheme (Bush and

Mosteller 1951) after each data point

• Test all parameter settings simultaneously (one time)• Match → reward all parameters

equally• Mismatch → penalize all

parameters equally

• Test each parameter individually (fixed # of times)• Reward more successful settings

more strongly• Computation still linear in the

number of parameters

10

Grammar

• Stochastic parameter grammar• Probability distribution over each parameter’s possible settings

Foot Headedness Footing Direction

L (Trochee)0.6 Right (Iamb)0.4 L-to-R0.3 R-to-L0.7

• Each time grammar produces output: one setting categorically chosen for each parameter (weighted coin flip)

11

Grammar

Stochastic parameter grammar•Data point production: choosing categorical parameter settings•



data point: ka ta paa ta na

production: (kata)(paa)(tana) – Match

(probability of picking this option: 0.6 x 0.7 = 0.42)

12

Grammar

Stochastic parameter grammar•Data • point production: choosing categorical parameter settings



data point: ka ta paa ta na

production: (kata)(paa)(tana) – Mismatch

(probability of picking this option: 0.4 x 0.3 = 0.12)

13

Gradualness

• Gradual, incremental updating of grammar

• Start from a uniform distribution over parameter settings

• Data points presented to learner one-by-one

• Each parameter setting’s new probability is weighted average of old probability and R (reward value – between 1 and 0) (Linear Reward-Penalty Scheme)

50/50 grammar updated grammar … … final grammar

data point data point data point

14

Update rule

• Linear Reward-Penalty Scheme (Bush and Mosteller 1951):• For each parameter setting yi (FootHead = L, FootHead = R, etc.):

𝑝 𝜓𝑖 𝑛𝑒𝑤 =𝑅(𝜓𝑖) ∗ 𝜆

+𝑝 𝜓𝑖 𝑜𝑙𝑑 ∗ (1 − 𝜆)

• 𝑝 𝜓𝑖 𝑜𝑙𝑑 is the parameter setting’s probability in the grammar before the update (e.g. p(FootHead=L)old = 0.6)

• 𝑅(𝜓𝑖) is the reward value, between 1 and 0 (see next slide)• 𝜆 is the learning rate, between 1 and 0; here, 𝜆 = 0.1 was chosen

15

Reward computation (NPL)

• Reward value (R) always 0 or 1, based on one single production of the current data point

• R = 1 for all parameter settings chosen if production yields match• R = 1 - 1 for all opposite values

• R = 0 for all parameter settings chosen if production yields mismatch

• R = 1 - 0 for all opposite values

16

NPL: Uniform reward values

NPL • has no way to determine when parameters matter (or not)

E.g., produce [• ka ta paa ta na] – chosen parameter settings include

MainStress• = L: essential for initial stress

ExtrametricalityEdge• = L (given that Extrametricality = On): incompatible with initial stress

Both receive R = • 0 for producing incorrect stress pattern

Desirable setting • MainStress = L penalized as “accomplice” (Yang 2002)

17



E.g., produce [• ka ta paa ta na] – chosen parameter settings include


ExtrametricalityEdge• = L (given that Extrametricality = On): incompatible with initial stress

Both receive R = • 0 for producing incorrect stress pattern

Desirable setting • MainStress = L penalized as “accomplice” (Yang 2002)

18



E.g• ., produce [ka ta paa ta na] – chosen parameter settings include


Weight• -by-Position = On (-VC rhymes count as heavy): irrelevant

Both receive R = • 1 for producing correct stress pattern

Irrelevant setting Weight• -by-Position = On rewarded as a “hitchhiker” (Yang 2002), may conflict with other data points

19

Reward computation (EDPL)

Each parameter setting• ’s reward (R) is a probabilityR(• yi) = p(yi|data point) – How useful is this parameter setting for successfully analyzing this data point?

• Approximates E-step in Expectation Maximization

Easily computed through Bayesian reformulation (• Jarosz submitted):

𝑅 𝜓𝑖 = 𝑝 𝜓𝑖 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡 =𝑝(𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡|𝜓𝑖) ∗ 𝑝(𝜓𝑖)𝑜𝑙𝑑

𝑝(𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡)

20

(by Bayes’ Rule)


• Each parameter setting’s reward (R) is a probability• R(yi) = p(yi|data point) – How useful is this parameter setting for successfully

analyzing this data point?


• Easily computed through Bayesian reformulation (Jarosz submitted):



21

Prob. of choosing this parameter setting (look up in current grammar)



Approximates • E-step in Expectation Maximization




22

Estimated by sampling (see below)


• Each parameter setting’s reward (R) is a probability• R(yi) = p(yi|data point) – How useful is this parameter setting for successfully

analyzing this data point?


• Easily computed through Bayesian reformulation (Jarosz submitted):



23

Derived from both previously mentioned quantities



Approximates • E-step in Expectation Maximization


𝑝(𝒅𝒂𝒕𝒂 𝒑𝒐𝒊𝒏𝒕) =𝑝(𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡|𝜓𝑖) ∗ 𝑝(𝜓𝑖)𝑜𝑙𝑑

+𝑝(𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡|¬𝜓𝑖) ∗ 𝑝(¬𝜓𝑖)𝑜𝑙𝑑

24

Probability of picking opposite value of the same parameter

Estimating p(data point|yi)

• p(data point|yi) defined as proportion of matches out of a sample of productions, supposing that yi is categorically in grammar• Temporarily set p(yi) to 1 in current grammar (e.g., set FootHead to L, keep all

other parameters probabilistic)• Produce data point r times and assess match/mismatch (we chose r = 50)

• 𝑝(𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡|𝜓𝑖) ≈𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑡𝑐ℎ𝑒𝑠

𝑟

• Reward computations are linear in number of parameters:• NPL: assess match/mismatch once for every data point• EDPL: number of match/mismatch assessments for every data point is

constant in number of parameters• Number of parameters * 2 (settings) * r (productions)

25

EDPL: Gradient reward values

EDPL estimates when • parameters matter (or not), and to which degree

E.g., produce [• ka ta paa ta na]

MainStress• = L essential

Only • MainStress = L can generate matches

R(• MainStress=L) = 1

ExtrametricalityEdge• = L incompatible

ExtrametricalityEdge• = L cannot generate matches

R(• ExtrametricalityEdge=L) = 0

Essential settings: max. reward, incompatible ones: max. punishment•

26







ExtrametricalityEdge• = L incompatible

ExtrametricalityEdge• = L cannot generate matches

R(• ExtrametricalityEdge=L) = 0

Essential settings: max. reward, incompatible ones: max. punishment•

27







Weight• -by-Position = On irrelevant

Weight• -by-Position = On/Off equally likely to yield match

R(W• -by-P=On) ≈ p(W-by-P=On), no change in p(yi)

Essential settings: max. reward, irrelevant settings: no change•

28


Reward • gradient and sensitive to other parameter settingsE.g., [• ma na ma na na] has trochaic and iambic parses

Trochee Iamb<ma> (na ma) (na na) (ma na) (ma na) <na>

29


Reward • gradient and sensitive to other parameter settingsE.g., [• ma na ma na na] has trochaic and iambic parses

Trochee Iamb<ma> (na ma) (na na) (ma na) (ma na) <na>

R(• FootForm = Trochee) higher if ExtrametricalityEdge = L more likely in grammar (based on previous data points)

Lower as p(• ExtrametricalityEdge=L) goes down

Learner• ’s beliefs influence analysis of incoming data

30

Simulations and Results

Stress Typology Test Set

23 • Languages in Dresher and Kaye’s system (1990)Focus on • 6 out of 10 parameters:

All parameters on foot properties, quantity• -sensitivity, and extrametricality

All combinations explored except Unbounded QI Iambs: have equal learning •properties to Unbounded QI Trochees

Constructed languages:•All possible • 3 to 6-syllable combinations of [ta], [taa], and [tan] with corresponding stress patterns

However, many patterns correspond to real languages (• 16 out of 23)

32

Learning Set-up

All • 23 languages given to NPL and EDPL

10 • runs for every language

Every run continues for • 1,000,000 iterations or till convergenceConvergence: • every word’s stress produced correctly 99 out of 100 times with current grammar (checked every 100 iterations)

Evaluation:•Did learner converge on a correct grammar?•

How quickly?•

33

NPL vs. EDPL

• NPL failed to converge for all but one language• Overall success rate: 4.3%

• within 89,370 iterations on average

• EDPL showed convergence for all languages• Overall success rate: 96.0%

• within 200 iterations on average

34

NPL vs. EDPL

NPL failed to converge for all but one language•Success for initial • stress, no secondary stress

ta.taa.tan.ta.ta.tan.tata.ta.ta.ta.taa etc.

Most ambiguous language: compatible • with more parameter setting combinations than all others

Over • 200 times slower than slowest language for the EDPL

35

NPL vs. EDPL

• EDPL showed convergence (≤400 iterations) for all languages• Success on all runs,

except Left-extrametrical QI Trochees (success on 1/10 runs)• Shares all forms with iambic languages, see below

• FootHeadedness = R rewarded as long as extrametricality not settled

• If p(FtHead=R) reaches 1, extrametricality can never be settled

36

L-to-R QI lang. 3 syllables 4 syllables 5 syllables 6 syllables

Left-Ext Trochees <> ( ) <> ( )( ) <> ( )( ) <> ( )( )( )Right-Ext Iambs ( ) <> ( )( ) <> ( )( ) <> ( )( )( ) <>Iambs (No Ext) ( )( ) ( )( ) ( )( )( ) ( )( )( )

Discussion and conclusion

Summary: NPL vs. EDPL

EDPL retains advantages of NPL:•Gradual update of stochastic grammar; computed in linear time•

At the same time, EDPL has higher power of inference than NPL:•Able to distinguish relevant from irrelevant parameters given a data point•

Leads to success on representative typology•

38


• EDPL retains advantages of NPL:• Gradual update of stochastic grammar; computed in linear time

• At the same time, EDPL has higher power of inference than NPL:• Able to distinguish relevant from irrelevant parameters given a data point

• Leads to success on representative typology

• EDPL provides mechanism for identifying unambiguous data (by gauging individual parameter settings’ success on a data point)• Takes over the function of Dresher & Kaye’s cues and Sakas and Fodor’s

(2001) parsing method

39

Implications for domain-specificity

Pearl (• 2007) argues: domain-general statistical learning of parameters fails, has to be supplemented by domain-specific mechanisms

We argue: NPL• ’s failure is not representative of statistical learning in general

EDPL performs well on stress parameter setting•

Therefore: the necessity of domain• -specific mechanisms for stress parameters is under question

Future work: understanding to what extent EDPL duplicates the effect •of domain-specific mechanisms (cues and their ordering)

40

Thank you!

Acknowledgments

We would like to thank

Adam Albright, • Gasper Begus, Naomi Feldman, Edward Flemming, Michael Kenstowicz, Erin Olson, Joe Pater, Ezer Rasin, Juliet Stanton, Kristine Yu, Sam Zukoff,

members • of the UMass Amherst Sound Workshop,

• members of the MIT Phonology Circle, and

participants of • NECPhon 10 at UMass Amherst

for their feedback that has greatly benefited this work. All errors remain our own.

42

ReferencesBerwick, Robert C., and Partha Niyogi. 1996. Learning from Triggers. Linguistic Inquiry 27(4): 605-622.

Bush, Robert, and Frederick Mosteller. 1951. A mathematical model for simple learning. Psychological Review 58, 313–323.

Chomsky, Noam. 1981. Lectures on government and binding. Dordrecht: Foris.

Dresher, B. Elan, and Jonathan D. Kaye. 1990. A computational learning model for metrical phonology. Cognition 34: 137-195.

Fodor, Janet D. 1998. Parsing to Learn. Journal of Psycholinguistic Research 27(3), 339-374.

Gibson, Edward, and Kenneth Wexler. 1994. Triggers. Linguistic Inquiry 25(3): 407-454.

Gould, Isaac. Syntactic Learning from Ambiguous Evidence: Errors and End-States. PhD dissertation, Massachusetts Institute of Technology.

Jarosz, Gaja. Submitted. Expectation Driven Learning of Phonology.

Lightfoot, D. 1999. The Development of Language: Acquisition, Change, and Evolution. Oxford: Blackwell.

Pearl, Lisa. 2007. Necessary Bias in Natural Langauge Learning. Doctoral dissertation, University of Maryland.

Pearl, Lisa. 2008. Putting the emphasis on unambiguous: The feasibility of data filtering for learning English metrical phonology. In Harvey Chan, Heather Jacob & Enkeleida Kapia (eds.), Proceedings of the 32nd annual Boston University Conference on Child Language Development (BUCLD 32), 390–401. Somerville, MA: Cascadilla Press.

Pearl, Lisa. 2009. Acquiring complex linguistic systems from natural language data: What selective learning biases can do. Ms. University of California, Irvine.

Pearl, Lisa. 2011. When Unbiased Probabilistic Learning is Not Enough: Acquiring a Parametric System of Metrical Phonology. Language Acquisition 18(2): 87-120.

Sakas, William G., and Fodor, Janet D. 2001. The structural triggers learner. In Stefano Bertolo (ed.), Language Acquisition and Learnability, 172-233. Cambridge University Press, Cambridge, UK.

Yang, Charles. 2002. Knowledge and Learning in Natural Language. Oxford: Oxford University Press.

43

Appendix

Dresher and Kaye’s parameters

10 • binary parameters on foot placement, quantity sensitivity, extrametricality, and other aspects of metrical phonology

Foot • placement (3 parameters):Foot Headedness (• Left/Right), Foot boundedness (Y/N), Ban on feet headed by a Light syllable (Y/N)

Quantity Sensitivity (Y/N) (• 1 parameter)

Extrametricality• (Y/N; Left/Right) (2 parameters)

Other parameters: CVC = Heavy, Footing direction, • Main stress, Secondary stress

45

EDPL: Non-uniform reward values

Parameter settings rewarded based on relevance to data point•

Irrelevant• parameter settings: R(yi) ≈ p(yi), no reward/punishmentMatch and mismatch equally often under either setting of the parameter•

Essential• parameter settings: R(yi) = 1, maximal reward

Incompatible• parameter settings: R(yi) = 0, maximal punishment

Possibly helpful • parameter settings: R(yi) > p(yi), nonzero rewardMatch more frequent for • yi than for yi

The better the parameter setting does, the larger the increase • in p(yi)

46

Parameter setting with cues (Dresher & Kaye)

MainStress=L if within 1 syllable of L word edge; Ditto for R

QI language• data point 1: ka ta paMainStress• set to L

Data point • 2: ka taa paNo parameters set•

• Data point 3: kaa taaNo parameters set•

END: • MainStress = L, QS = Off

QS=On if same length words have different stress; Initial value: Off

QS language• data point 1: ka ta paMain Stress set to L•

Data point • 2: ka taa paQS set to On • (cf. data point 1)


END: • MainStress = L, QS = On

47

Parameter setting with cues (Pearl)

MainStress=L if within 1 syllable of L word edge; Ditto for R

QI language• data point 1: ka ta paMainStress• set to L

Data point • 2: ka taa paQS set to Off•


END: • MainStress = L, QS = Off

QS=On if 2syll word with 2 stresses; QS=Off if unstr. word-medial –VV

QS language• data point 1: ka ta paMain Stress set to L•

Data point • 2: ka taa paNo parameters set•

• Data point 3: kaa taaQS set to On•

END: • MainStress = L, QS = On

48

Parameter setting with EDPL (l = 0.5)

Start with p(MainStress=L) = 0.5 and p(MainStress=R) = 0.5

QI language• data point 1: ka ta paR(MS=L) = • 1, R(QS=Off) = 0.5

Data point • 2: ka taa paR(MS=L) = • 1, R(QS=Off) = 1

• Data point 3: kaa taaR(MS=L) = • 0.5, R(QS=Off) = 1

END: • MS=Lꜛꜛ0.88, QS=Offꜛꜛ0.88

Start with p(QS=On) = 0.5 and p(QS=Off) = 0.5

QS language• data point 1: ka ta paR(MS=L) = • 1, R(QS=On) = 0.5

Data point • 2: ka taa paR(MS=L) = • 1, R(QS=On) = 1

Data point • 3: kaa taaR(MS=L) = • 1, R(QS=On) = 1

END: • MS=Lꜛꜛꜛ0.94, QS=Onꜛꜛ0.88

49

Parameter setting with NPL (l = 0.5)

Start with p(MainStress=L) = 0.5 and p(MainStress=R) = 0.5

QI language• data point 1: ka ta paMS=R and QS=Off: *(• kata)(pa)

Data point • 2: ka taa paMS=L and QS=On: *(• ka)(taapa)

• Data point 3: kaa taaMS=R and QS=Off: (• kaataa)

END: • MS=Rꜜꜛꜛ0.81, QS=Offꜜꜛꜛ0.81

Start with p(QS=On) = 0.5 and p(QS=Off) = 0.5

QS language• data point 1: ka ta paMS=L and QS=Off: (• kata)(pa)

Data point • 2: ka taa paMS=L and QS=Off: *(• kataa)(pa)

• Data point 3: kaa taaMS=R and QS=On: *(• kaa)(taa)

END: • MS=Lꜛꜜꜛ0.69, QS=Offꜛꜜꜛ0.69

50

Connection to Expectation Maximization

• EDPL’s reward computation approximates Expectation Maximization (EM; Dempster et al. 1977)• R(yi) = p(yi|data point): gradual (on-line) approximation of

p(yi)new = p(yi|data set)

• p(data point|yi) estimated through sampling rather than analytically

• EM guaranteed to find (local) maximum of data likelihood

• No such guarantee for the NPL

51


EDPL retains advantages of NPL:•Gradual update of stochastic grammar; computed in linear time•

At the same time, EDPL has higher power of inference than NPL:•Able to distinguish relevant from irrelevant parameters given a data point•

Leads to success on representative typology•

EDPL • also indirectly provides mechanism for parameter ordering:Some • ys (extrametricality, QS) can be found when the rest of structure (foot headedness, foot boundaries) is not yet (completely) established

52

learning parametric stress without domain-specific mechanisms

Documents