a machine learning approach to modeling scope preferences · 2003-11-13 · ®c2003 association for...

24
c ® 2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences Derrick Higgins ¤ Jerrold M. Sadock y University of Chicago University of Chicago This article describes a corpus-based investigation of quanti er scope preferences. Following recent work on multimodular grammar frameworks in theoretical linguistics and a long history of combining multiple information sources in natural language processing, scope is treated as a distinct module of grammar from syntax. This module incorporates multiple sources of evidence regarding the most likely scope reading for a sentence and is entirely data-driven. The experiments discussed in this article evaluate the performance of our models in predicting the most likely scope reading for a particular sentence, using Penn Treebank data both with and without syntactic annotation. We wish to focus attention on the issue of determining scope preferences, which has largely been ignored in theoretical linguistics, and to explore different models of the interaction between syntax and quanti er scope. 1. Overview This article addresses the issue of determining the most accessible quanti er scope reading for a sentence. Quanti ers are elements of natural and logical languages (such as each, no, and some in English and 8 and 9 in predicate calculus) that have certain semantic properties. Loosely speaking, they express that a proposition holds for some proportion of a set of individuals. One peculiarity of these expressions is that there can be semantic differences that depend on the order in which the quanti ers are interpreted. These are known as scope differences. (1) Everyone likes two songs on this album. As an example of the sort of interpretive differences we are talking about, consider the sentence in (1). There are two readings of this sentence; which reading is meant depends on which of the two quanti ed expressions everyone and two songs on this album takes wide scope. The rst reading, in which everyone takes wide scope, simply implies that every person has a certain preference, not necessarily related to anyone else’s. This reading can be paraphrased as “Pick any person, and that person will like two songs on this album.” The second reading, in which everyone takes narrow scope, implies that there are two speci c songs on the album of which everyone is fond, say, “Blue Moon” and “My Way.” In theoretical linguistics, attention has been primarily focused on the issue of scope generation. Researchers applying the techniques of quanti er raising and Cooper stor- age have been concerned mainly with enumerating all of the scope readings for a ¤ Department of Linguistics, University of Chicago, 1010 East 59th Street, Chicago, IL 60637. E-mail: [email protected]. y Department of Linguistics, University of Chicago, 1010 East 59th Street, Chicago, IL 60637. E-mail: [email protected].

Upload: others

Post on 13-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

creg 2003 Association for Computational Linguistics

A Machine Learning Approach toModeling Scope Preferences

Derrick Higginscurren Jerrold M Sadocky

University of Chicago University of Chicago

This article describes a corpus-based investigation of quantier scope preferences Followingrecent work on multimodular grammar frameworks in theoretical linguistics and a long historyof combining multiple information sources in natural language processing scope is treated as adistinct module of grammar from syntax This module incorporates multiple sources of evidenceregarding the most likely scope reading for a sentence and is entirely data-drivenThe experimentsdiscussed in this article evaluate the performance of our models in predicting the most likely scopereading for a particular sentence using Penn Treebank data both with and without syntacticannotation We wish to focus attention on the issue of determining scope preferences which haslargely been ignored in theoretical linguistics and to explore different models of the interactionbetween syntax and quantier scope

1 Overview

This article addresses the issue of determining the most accessible quantier scopereading for a sentence Quantiers are elements of natural and logical languages (suchas each no and some in English and 8 and 9 in predicate calculus) that have certainsemantic properties Loosely speaking they express that a proposition holds for someproportion of a set of individuals One peculiarity of these expressions is that therecan be semantic differences that depend on the order in which the quantiers areinterpreted These are known as scope differences

(1) Everyone likes two songs on this album

As an example of the sort of interpretive differences we are talking about considerthe sentence in (1) There are two readings of this sentence which reading is meantdepends on which of the two quantied expressions everyone and two songs on thisalbum takes wide scope The rst reading in which everyone takes wide scope simplyimplies that every person has a certain preference not necessarily related to anyoneelsersquos This reading can be paraphrased as ldquoPick any person and that person will liketwo songs on this albumrdquo The second reading in which everyone takes narrow scopeimplies that there are two specic songs on the album of which everyone is fond sayldquoBlue Moonrdquo and ldquoMy Wayrdquo

In theoretical linguistics attention has been primarily focused on the issue of scopegeneration Researchers applying the techniques of quantier raising and Cooper stor-age have been concerned mainly with enumerating all of the scope readings for a

curren Department of Linguistics University of Chicago 1010 East 59th Street Chicago IL 60637 E-maildchigginalumniuchicagoedu

y Department of Linguistics University of Chicago 1010 East 59th Street Chicago IL 60637 E-mailj-sadockuchicagoedu

74

Computational Linguistics Volume 29 Number 1

sentence that are possible without regard to their relative likelihood or naturalnessRecently however linguists such as Kuno Takami and Wu (1999) have begun to turntheir attention to scope prediction or determining the relative accessibility of differentscope readings

In computational linguistics more attention has been paid to the factors that de-termine scope preferences Systems such as the SRI Core Language Engine (Moran1988 Moran and Pereira 1992) LUNAR (Woods 1986) and TEAM (Martin Appeltand Pereira 1986) have employed scope critics that use heuristics to decide betweenalternative scopings However the rules that these systems use in making quantierscope decisions are motivated only by the researchersrsquo intuitions and no empiricalresults have been published regarding their accuracy

In this article we use the tools of machine learning to construct a data-drivenmodel of quantier scope preferences For theoretical linguistics this model serves asan illustration that Kuno Takami and Wursquos approach can capture some of the clear-est generalizations about quantier scoping For computational linguistics this articleprovides a baseline result on the task of scope prediction with which other scopecritics can be compared In addition it is the most extensive empirical investigationof which we are aware that collects data of any kind regarding the relative frequencyof different quantier scope readings in English text1

Section 2 briey discusses treatments of scoping issues in theoretical linguisticsand Section 3 reviews the computational work that has been done on natural languagequantier scope In Section 4 we introduce the models that we use to predict quantierscoping as well as the data on which they are trained and tested Section 5 combinesthe scope model of the previous section with a probabilistic context-free grammar(PCFG) model of syntax and addresses the issue of whether these two modules ofgrammar ought to be combined in serial with information from the syntax feeding thequantier scope module or in parallel with each module constraining the structuresprovided by the other

2 Approaches to Quantier Scope in Theoretical Linguistics

Most if not all linguistic treatments of quantier scope have closely integrated it withthe way in which the syntactic structure of a sentence is built up Montague (1973) useda syntactic rule to introduce a quantied expression into a derivation at the point whereit was to take scope whereas generative semantic analyses such as McCawley (1998)represented the scope of quantication at deep structure transformationally loweringquantiers into their surface positions during the course of the derivation More recentwork in the interpretive paradigm takes the opposite approach extracting quantiersfrom their surface positions to their scope positions by means of a quantier-raising(QR) transformation (May 1985 Aoun and Li 1993 Hornstein 1995) Another populartechnique is to percolate scope information up through the syntactic tree using Cooperstorage (Cooper 1983 Hobbs and Shieber 1987 Pollard 1989 Nerbonne 1993 Park 1995Pollard and Yoo 1998)

The QR approach to dealing with scope in linguistics consists in the claim thatthere is a covert transformation applying to syntactic structures that moves quantiedelements out of the position in which they are found on the surface and raises them toa higher position that reects their scope The various incarnations of the strategy that

1 See Carden (1976) however for a questionnaire-based approach to gathering data on the accessibilityof different quantier scope readings

75

Higgins and Sadock Modeling Scope Preferences

Figure 1Simple illustration of the QR approach to quantier scope generation

follows from this claim differ in the precise characterization of this QR transforma-tion what conditions are placed upon it and what tree-congurational relationshipis required for one operator to take scope over another The general idea of QR isrepresented in Figure 1 a schematic analysis of the reading of the sentence Someonesaw everyone in which someone takes wide scope (ie lsquothere is some person x such thatfor all persons y x saw yrsquo)

In the Cooper storage approach quantiers are gathered into a store and passedupward through a syntactic tree At certain nodes along the way quantiers may beretrieved from the store and take scope The relative scope of quantiers is determinedby where each quantier is retrieved from the store with quantiers higher in the treetaking wide scope over lower ones As with QR different authors implement thisscheme in slightly different ways but the simplest case is represented in Figure 2 theCooper storage analog of Figure 1

These structural approaches QR and Cooper storage have in common that theyallow syntactic factors to have an effect only on the scope readings that are available fora given sentence They are also similar in addressing only the issue of scope generationor identifying all and only the accessible readings for each sentence That is to saythey do not address the issue of the relative salience of these readings

Kuno Takami and Wu (1999 2001) propose to model the scope of quantiedelements with a set of interacting expert systems that basically consists of a weightedvote taken of the various factors that may inuence scope readings This model ismeant to account not only for scope generation but also for ldquothe relative strengths ofthe potential scope interpretations of a given sentencerdquo (1999 page 63) They illustratethe plausibility of this approach in their paper by presenting a number of examplesthat are accounted for fairly well by the approach even when an unweighted vote ofthe factors is allowed to be taken

So for example in Kuno Takami and Wursquos (49b) (1999) repeated here as (2) thecorrect prediction is made that the sentence is unambiguous with the rst quantiednoun phrase (NP) taking wide scope over the second (the reading in which we donrsquotall have to hate the same people) Table 1 illustrates how the votes of each of KunoTakami and Wursquos ldquoexpertsrdquo contribute to this outcome Since the expression many ofus=you receives more votes and the numbers for the two competing quantied expres-sions are quite far apart the rst one is predicted to take wide scope unambiguously

(2) Many of us=you hate some of them

76

Computational Linguistics Volume 29 Number 1

Figure 2Simple illustration of the Cooper storage approach to quantier scope generation

Table 1Voting to determine optimal scope readings for quantiers according to Kuno Takami andWu (1999)

many of us you some of them

Baselinep p

Subject Qp

Lefthand Qp

SpeakerHearer Qp

Total 4 1

Some adherents of the structural approaches also seem to acknowledge the ne-cessity of eventually coming to terms with the factors that play a role in determiningscope preferences in language Aoun and Li (2000) claim that the lexical scope pref-erences of quantiers ldquoare not ruled out under a structural accountrdquo (page 140) It isclear from the surrounding discussion though that they intend such lexical require-ments to be taken care of in some nonsyntactic component of grammar AlthoughKuno Takami and Wursquos dialogue with Aoun and Li in Language has been portrayedby both sides as a debate over the correct way of modeling quantier scope they arenot really modeling the same things Whereas Aoun and Li (1993) provide an accountof scope generation Kuno Takami and Wu (1999) intend to model both scope gen-eration and scope prediction The model of scope preferences provided in this articleis an empirically based renement of the approach taken by Kuno Takami and Wubut in principle it is consistent with a structural account of scope generation

77

Higgins and Sadock Modeling Scope Preferences

3 Approaches to Quantier Scope in Computational Linguistics

Many studies such as Pereira (1990) and Park (1995) have dealt with the issue ofscope generation from a computational perspective Attempts have also been madein computational work to extend a pure Cooper storage approach to handle scopeprediction Hobbs and Shieber (1987) discuss the possibility of incorporating some sortof ordering heuristics into the SRI scope generation system in the hopes of producinga ranked list of possible scope readings but ultimately are forced to acknowledge thatldquo[t]he modications turn out to be quite complicated if we wish to order quantiersaccording to lexical heuristics such as having each out-scope some Because of therecursive nature of the algorithm there are limits to the amount of ordering that canbe done in this mannerrdquo (page 55) The stepwise nature of these scope mechanismsmakes it hard to state the factors that inuence the preference for one quantier totake scope over another

Those natural language processing (NLP) systems that have managed to providesome sort of account of quantier scope preferences have done so by using a separatesystem of heuristics (or scope critics) that apply postsyntactically to determine the mostlikely scoping LUNAR (Woods 1986) TEAM (Martin Appelt and Pereira 1986) andthe SRI Core Language Engine as described by Moran (1988 Moran and Pereira 1992)all employ scope rules of this sort By and large these rules are of an ad hoc natureimplementing a linguistrsquos intuitive idea of what factors determine scope possibilitiesand no results have been published regarding the accuracy of these methods Forexample Moran (1988) incorporates rules from other NLP systems and from VanLehn(1978) such as a preference for a logically weaker interpretation the tendency for eachto take wide scope and a ban on raising a quantier across multiple major clauseboundaries The testing of Moranrsquos system is ldquolimited to checking conformance tothe stated rulesrdquo (pages 40ndash41) In addition these systems are generally incapable ofhandling unrestricted text such as that found in the Wall Street Journal corpus in arobust way because they need to do a full semantic analysis of a sentence in orderto make scope predictions The statistical basis of the model presented in this articleoffers increased robustness and the possibility of more serious evaluation on the basisof corpus data

4 Modeling Quantier Scope

In this section we argue for an empirically driven machine learning approach tothe identication of factors relevant to quantier scope and the modeling of scopepreferences Following much recent work that applies the tools of machine learning tolinguistic problems (Brill 1995 Pedersen 2000 van Halteren Zavrel and Daelemans2001 Soon Ng and Lim 2001) we will treat the prediction of quantier scope asan example of a classication task Our aim is to provide a robust model of scopeprediction based on Kuno Takami and Wursquos theoretical foundation and to addressthe serious lack of empirical results regarding quantier scope in computational workWe describe here the modeling tools borrowed from the eld of articial intelligencefor the scope prediction task and the data from which the generalizations are to belearned Finally we present the results of training different incarnations of our scopemodule on the data and assess the implications of this exercise for theoretical andcomputational linguistics

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 2: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

74

Computational Linguistics Volume 29 Number 1

sentence that are possible without regard to their relative likelihood or naturalnessRecently however linguists such as Kuno Takami and Wu (1999) have begun to turntheir attention to scope prediction or determining the relative accessibility of differentscope readings

In computational linguistics more attention has been paid to the factors that de-termine scope preferences Systems such as the SRI Core Language Engine (Moran1988 Moran and Pereira 1992) LUNAR (Woods 1986) and TEAM (Martin Appeltand Pereira 1986) have employed scope critics that use heuristics to decide betweenalternative scopings However the rules that these systems use in making quantierscope decisions are motivated only by the researchersrsquo intuitions and no empiricalresults have been published regarding their accuracy

In this article we use the tools of machine learning to construct a data-drivenmodel of quantier scope preferences For theoretical linguistics this model serves asan illustration that Kuno Takami and Wursquos approach can capture some of the clear-est generalizations about quantier scoping For computational linguistics this articleprovides a baseline result on the task of scope prediction with which other scopecritics can be compared In addition it is the most extensive empirical investigationof which we are aware that collects data of any kind regarding the relative frequencyof different quantier scope readings in English text1

Section 2 briey discusses treatments of scoping issues in theoretical linguisticsand Section 3 reviews the computational work that has been done on natural languagequantier scope In Section 4 we introduce the models that we use to predict quantierscoping as well as the data on which they are trained and tested Section 5 combinesthe scope model of the previous section with a probabilistic context-free grammar(PCFG) model of syntax and addresses the issue of whether these two modules ofgrammar ought to be combined in serial with information from the syntax feeding thequantier scope module or in parallel with each module constraining the structuresprovided by the other

2 Approaches to Quantier Scope in Theoretical Linguistics

Most if not all linguistic treatments of quantier scope have closely integrated it withthe way in which the syntactic structure of a sentence is built up Montague (1973) useda syntactic rule to introduce a quantied expression into a derivation at the point whereit was to take scope whereas generative semantic analyses such as McCawley (1998)represented the scope of quantication at deep structure transformationally loweringquantiers into their surface positions during the course of the derivation More recentwork in the interpretive paradigm takes the opposite approach extracting quantiersfrom their surface positions to their scope positions by means of a quantier-raising(QR) transformation (May 1985 Aoun and Li 1993 Hornstein 1995) Another populartechnique is to percolate scope information up through the syntactic tree using Cooperstorage (Cooper 1983 Hobbs and Shieber 1987 Pollard 1989 Nerbonne 1993 Park 1995Pollard and Yoo 1998)

The QR approach to dealing with scope in linguistics consists in the claim thatthere is a covert transformation applying to syntactic structures that moves quantiedelements out of the position in which they are found on the surface and raises them toa higher position that reects their scope The various incarnations of the strategy that

1 See Carden (1976) however for a questionnaire-based approach to gathering data on the accessibilityof different quantier scope readings

75

Higgins and Sadock Modeling Scope Preferences

Figure 1Simple illustration of the QR approach to quantier scope generation

follows from this claim differ in the precise characterization of this QR transforma-tion what conditions are placed upon it and what tree-congurational relationshipis required for one operator to take scope over another The general idea of QR isrepresented in Figure 1 a schematic analysis of the reading of the sentence Someonesaw everyone in which someone takes wide scope (ie lsquothere is some person x such thatfor all persons y x saw yrsquo)

In the Cooper storage approach quantiers are gathered into a store and passedupward through a syntactic tree At certain nodes along the way quantiers may beretrieved from the store and take scope The relative scope of quantiers is determinedby where each quantier is retrieved from the store with quantiers higher in the treetaking wide scope over lower ones As with QR different authors implement thisscheme in slightly different ways but the simplest case is represented in Figure 2 theCooper storage analog of Figure 1

These structural approaches QR and Cooper storage have in common that theyallow syntactic factors to have an effect only on the scope readings that are available fora given sentence They are also similar in addressing only the issue of scope generationor identifying all and only the accessible readings for each sentence That is to saythey do not address the issue of the relative salience of these readings

Kuno Takami and Wu (1999 2001) propose to model the scope of quantiedelements with a set of interacting expert systems that basically consists of a weightedvote taken of the various factors that may inuence scope readings This model ismeant to account not only for scope generation but also for ldquothe relative strengths ofthe potential scope interpretations of a given sentencerdquo (1999 page 63) They illustratethe plausibility of this approach in their paper by presenting a number of examplesthat are accounted for fairly well by the approach even when an unweighted vote ofthe factors is allowed to be taken

So for example in Kuno Takami and Wursquos (49b) (1999) repeated here as (2) thecorrect prediction is made that the sentence is unambiguous with the rst quantiednoun phrase (NP) taking wide scope over the second (the reading in which we donrsquotall have to hate the same people) Table 1 illustrates how the votes of each of KunoTakami and Wursquos ldquoexpertsrdquo contribute to this outcome Since the expression many ofus=you receives more votes and the numbers for the two competing quantied expres-sions are quite far apart the rst one is predicted to take wide scope unambiguously

(2) Many of us=you hate some of them

76

Computational Linguistics Volume 29 Number 1

Figure 2Simple illustration of the Cooper storage approach to quantier scope generation

Table 1Voting to determine optimal scope readings for quantiers according to Kuno Takami andWu (1999)

many of us you some of them

Baselinep p

Subject Qp

Lefthand Qp

SpeakerHearer Qp

Total 4 1

Some adherents of the structural approaches also seem to acknowledge the ne-cessity of eventually coming to terms with the factors that play a role in determiningscope preferences in language Aoun and Li (2000) claim that the lexical scope pref-erences of quantiers ldquoare not ruled out under a structural accountrdquo (page 140) It isclear from the surrounding discussion though that they intend such lexical require-ments to be taken care of in some nonsyntactic component of grammar AlthoughKuno Takami and Wursquos dialogue with Aoun and Li in Language has been portrayedby both sides as a debate over the correct way of modeling quantier scope they arenot really modeling the same things Whereas Aoun and Li (1993) provide an accountof scope generation Kuno Takami and Wu (1999) intend to model both scope gen-eration and scope prediction The model of scope preferences provided in this articleis an empirically based renement of the approach taken by Kuno Takami and Wubut in principle it is consistent with a structural account of scope generation

77

Higgins and Sadock Modeling Scope Preferences

3 Approaches to Quantier Scope in Computational Linguistics

Many studies such as Pereira (1990) and Park (1995) have dealt with the issue ofscope generation from a computational perspective Attempts have also been madein computational work to extend a pure Cooper storage approach to handle scopeprediction Hobbs and Shieber (1987) discuss the possibility of incorporating some sortof ordering heuristics into the SRI scope generation system in the hopes of producinga ranked list of possible scope readings but ultimately are forced to acknowledge thatldquo[t]he modications turn out to be quite complicated if we wish to order quantiersaccording to lexical heuristics such as having each out-scope some Because of therecursive nature of the algorithm there are limits to the amount of ordering that canbe done in this mannerrdquo (page 55) The stepwise nature of these scope mechanismsmakes it hard to state the factors that inuence the preference for one quantier totake scope over another

Those natural language processing (NLP) systems that have managed to providesome sort of account of quantier scope preferences have done so by using a separatesystem of heuristics (or scope critics) that apply postsyntactically to determine the mostlikely scoping LUNAR (Woods 1986) TEAM (Martin Appelt and Pereira 1986) andthe SRI Core Language Engine as described by Moran (1988 Moran and Pereira 1992)all employ scope rules of this sort By and large these rules are of an ad hoc natureimplementing a linguistrsquos intuitive idea of what factors determine scope possibilitiesand no results have been published regarding the accuracy of these methods Forexample Moran (1988) incorporates rules from other NLP systems and from VanLehn(1978) such as a preference for a logically weaker interpretation the tendency for eachto take wide scope and a ban on raising a quantier across multiple major clauseboundaries The testing of Moranrsquos system is ldquolimited to checking conformance tothe stated rulesrdquo (pages 40ndash41) In addition these systems are generally incapable ofhandling unrestricted text such as that found in the Wall Street Journal corpus in arobust way because they need to do a full semantic analysis of a sentence in orderto make scope predictions The statistical basis of the model presented in this articleoffers increased robustness and the possibility of more serious evaluation on the basisof corpus data

4 Modeling Quantier Scope

In this section we argue for an empirically driven machine learning approach tothe identication of factors relevant to quantier scope and the modeling of scopepreferences Following much recent work that applies the tools of machine learning tolinguistic problems (Brill 1995 Pedersen 2000 van Halteren Zavrel and Daelemans2001 Soon Ng and Lim 2001) we will treat the prediction of quantier scope asan example of a classication task Our aim is to provide a robust model of scopeprediction based on Kuno Takami and Wursquos theoretical foundation and to addressthe serious lack of empirical results regarding quantier scope in computational workWe describe here the modeling tools borrowed from the eld of articial intelligencefor the scope prediction task and the data from which the generalizations are to belearned Finally we present the results of training different incarnations of our scopemodule on the data and assess the implications of this exercise for theoretical andcomputational linguistics

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 3: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

75

Higgins and Sadock Modeling Scope Preferences

Figure 1Simple illustration of the QR approach to quantier scope generation

follows from this claim differ in the precise characterization of this QR transforma-tion what conditions are placed upon it and what tree-congurational relationshipis required for one operator to take scope over another The general idea of QR isrepresented in Figure 1 a schematic analysis of the reading of the sentence Someonesaw everyone in which someone takes wide scope (ie lsquothere is some person x such thatfor all persons y x saw yrsquo)

In the Cooper storage approach quantiers are gathered into a store and passedupward through a syntactic tree At certain nodes along the way quantiers may beretrieved from the store and take scope The relative scope of quantiers is determinedby where each quantier is retrieved from the store with quantiers higher in the treetaking wide scope over lower ones As with QR different authors implement thisscheme in slightly different ways but the simplest case is represented in Figure 2 theCooper storage analog of Figure 1

These structural approaches QR and Cooper storage have in common that theyallow syntactic factors to have an effect only on the scope readings that are available fora given sentence They are also similar in addressing only the issue of scope generationor identifying all and only the accessible readings for each sentence That is to saythey do not address the issue of the relative salience of these readings

Kuno Takami and Wu (1999 2001) propose to model the scope of quantiedelements with a set of interacting expert systems that basically consists of a weightedvote taken of the various factors that may inuence scope readings This model ismeant to account not only for scope generation but also for ldquothe relative strengths ofthe potential scope interpretations of a given sentencerdquo (1999 page 63) They illustratethe plausibility of this approach in their paper by presenting a number of examplesthat are accounted for fairly well by the approach even when an unweighted vote ofthe factors is allowed to be taken

So for example in Kuno Takami and Wursquos (49b) (1999) repeated here as (2) thecorrect prediction is made that the sentence is unambiguous with the rst quantiednoun phrase (NP) taking wide scope over the second (the reading in which we donrsquotall have to hate the same people) Table 1 illustrates how the votes of each of KunoTakami and Wursquos ldquoexpertsrdquo contribute to this outcome Since the expression many ofus=you receives more votes and the numbers for the two competing quantied expres-sions are quite far apart the rst one is predicted to take wide scope unambiguously

(2) Many of us=you hate some of them

76

Computational Linguistics Volume 29 Number 1

Figure 2Simple illustration of the Cooper storage approach to quantier scope generation

Table 1Voting to determine optimal scope readings for quantiers according to Kuno Takami andWu (1999)

many of us you some of them

Baselinep p

Subject Qp

Lefthand Qp

SpeakerHearer Qp

Total 4 1

Some adherents of the structural approaches also seem to acknowledge the ne-cessity of eventually coming to terms with the factors that play a role in determiningscope preferences in language Aoun and Li (2000) claim that the lexical scope pref-erences of quantiers ldquoare not ruled out under a structural accountrdquo (page 140) It isclear from the surrounding discussion though that they intend such lexical require-ments to be taken care of in some nonsyntactic component of grammar AlthoughKuno Takami and Wursquos dialogue with Aoun and Li in Language has been portrayedby both sides as a debate over the correct way of modeling quantier scope they arenot really modeling the same things Whereas Aoun and Li (1993) provide an accountof scope generation Kuno Takami and Wu (1999) intend to model both scope gen-eration and scope prediction The model of scope preferences provided in this articleis an empirically based renement of the approach taken by Kuno Takami and Wubut in principle it is consistent with a structural account of scope generation

77

Higgins and Sadock Modeling Scope Preferences

3 Approaches to Quantier Scope in Computational Linguistics

Many studies such as Pereira (1990) and Park (1995) have dealt with the issue ofscope generation from a computational perspective Attempts have also been madein computational work to extend a pure Cooper storage approach to handle scopeprediction Hobbs and Shieber (1987) discuss the possibility of incorporating some sortof ordering heuristics into the SRI scope generation system in the hopes of producinga ranked list of possible scope readings but ultimately are forced to acknowledge thatldquo[t]he modications turn out to be quite complicated if we wish to order quantiersaccording to lexical heuristics such as having each out-scope some Because of therecursive nature of the algorithm there are limits to the amount of ordering that canbe done in this mannerrdquo (page 55) The stepwise nature of these scope mechanismsmakes it hard to state the factors that inuence the preference for one quantier totake scope over another

Those natural language processing (NLP) systems that have managed to providesome sort of account of quantier scope preferences have done so by using a separatesystem of heuristics (or scope critics) that apply postsyntactically to determine the mostlikely scoping LUNAR (Woods 1986) TEAM (Martin Appelt and Pereira 1986) andthe SRI Core Language Engine as described by Moran (1988 Moran and Pereira 1992)all employ scope rules of this sort By and large these rules are of an ad hoc natureimplementing a linguistrsquos intuitive idea of what factors determine scope possibilitiesand no results have been published regarding the accuracy of these methods Forexample Moran (1988) incorporates rules from other NLP systems and from VanLehn(1978) such as a preference for a logically weaker interpretation the tendency for eachto take wide scope and a ban on raising a quantier across multiple major clauseboundaries The testing of Moranrsquos system is ldquolimited to checking conformance tothe stated rulesrdquo (pages 40ndash41) In addition these systems are generally incapable ofhandling unrestricted text such as that found in the Wall Street Journal corpus in arobust way because they need to do a full semantic analysis of a sentence in orderto make scope predictions The statistical basis of the model presented in this articleoffers increased robustness and the possibility of more serious evaluation on the basisof corpus data

4 Modeling Quantier Scope

In this section we argue for an empirically driven machine learning approach tothe identication of factors relevant to quantier scope and the modeling of scopepreferences Following much recent work that applies the tools of machine learning tolinguistic problems (Brill 1995 Pedersen 2000 van Halteren Zavrel and Daelemans2001 Soon Ng and Lim 2001) we will treat the prediction of quantier scope asan example of a classication task Our aim is to provide a robust model of scopeprediction based on Kuno Takami and Wursquos theoretical foundation and to addressthe serious lack of empirical results regarding quantier scope in computational workWe describe here the modeling tools borrowed from the eld of articial intelligencefor the scope prediction task and the data from which the generalizations are to belearned Finally we present the results of training different incarnations of our scopemodule on the data and assess the implications of this exercise for theoretical andcomputational linguistics

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 4: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

76

Computational Linguistics Volume 29 Number 1

Figure 2Simple illustration of the Cooper storage approach to quantier scope generation

Table 1Voting to determine optimal scope readings for quantiers according to Kuno Takami andWu (1999)

many of us you some of them

Baselinep p

Subject Qp

Lefthand Qp

SpeakerHearer Qp

Total 4 1

Some adherents of the structural approaches also seem to acknowledge the ne-cessity of eventually coming to terms with the factors that play a role in determiningscope preferences in language Aoun and Li (2000) claim that the lexical scope pref-erences of quantiers ldquoare not ruled out under a structural accountrdquo (page 140) It isclear from the surrounding discussion though that they intend such lexical require-ments to be taken care of in some nonsyntactic component of grammar AlthoughKuno Takami and Wursquos dialogue with Aoun and Li in Language has been portrayedby both sides as a debate over the correct way of modeling quantier scope they arenot really modeling the same things Whereas Aoun and Li (1993) provide an accountof scope generation Kuno Takami and Wu (1999) intend to model both scope gen-eration and scope prediction The model of scope preferences provided in this articleis an empirically based renement of the approach taken by Kuno Takami and Wubut in principle it is consistent with a structural account of scope generation

77

Higgins and Sadock Modeling Scope Preferences

3 Approaches to Quantier Scope in Computational Linguistics

Many studies such as Pereira (1990) and Park (1995) have dealt with the issue ofscope generation from a computational perspective Attempts have also been madein computational work to extend a pure Cooper storage approach to handle scopeprediction Hobbs and Shieber (1987) discuss the possibility of incorporating some sortof ordering heuristics into the SRI scope generation system in the hopes of producinga ranked list of possible scope readings but ultimately are forced to acknowledge thatldquo[t]he modications turn out to be quite complicated if we wish to order quantiersaccording to lexical heuristics such as having each out-scope some Because of therecursive nature of the algorithm there are limits to the amount of ordering that canbe done in this mannerrdquo (page 55) The stepwise nature of these scope mechanismsmakes it hard to state the factors that inuence the preference for one quantier totake scope over another

Those natural language processing (NLP) systems that have managed to providesome sort of account of quantier scope preferences have done so by using a separatesystem of heuristics (or scope critics) that apply postsyntactically to determine the mostlikely scoping LUNAR (Woods 1986) TEAM (Martin Appelt and Pereira 1986) andthe SRI Core Language Engine as described by Moran (1988 Moran and Pereira 1992)all employ scope rules of this sort By and large these rules are of an ad hoc natureimplementing a linguistrsquos intuitive idea of what factors determine scope possibilitiesand no results have been published regarding the accuracy of these methods Forexample Moran (1988) incorporates rules from other NLP systems and from VanLehn(1978) such as a preference for a logically weaker interpretation the tendency for eachto take wide scope and a ban on raising a quantier across multiple major clauseboundaries The testing of Moranrsquos system is ldquolimited to checking conformance tothe stated rulesrdquo (pages 40ndash41) In addition these systems are generally incapable ofhandling unrestricted text such as that found in the Wall Street Journal corpus in arobust way because they need to do a full semantic analysis of a sentence in orderto make scope predictions The statistical basis of the model presented in this articleoffers increased robustness and the possibility of more serious evaluation on the basisof corpus data

4 Modeling Quantier Scope

In this section we argue for an empirically driven machine learning approach tothe identication of factors relevant to quantier scope and the modeling of scopepreferences Following much recent work that applies the tools of machine learning tolinguistic problems (Brill 1995 Pedersen 2000 van Halteren Zavrel and Daelemans2001 Soon Ng and Lim 2001) we will treat the prediction of quantier scope asan example of a classication task Our aim is to provide a robust model of scopeprediction based on Kuno Takami and Wursquos theoretical foundation and to addressthe serious lack of empirical results regarding quantier scope in computational workWe describe here the modeling tools borrowed from the eld of articial intelligencefor the scope prediction task and the data from which the generalizations are to belearned Finally we present the results of training different incarnations of our scopemodule on the data and assess the implications of this exercise for theoretical andcomputational linguistics

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 5: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

77

Higgins and Sadock Modeling Scope Preferences

3 Approaches to Quantier Scope in Computational Linguistics

Many studies such as Pereira (1990) and Park (1995) have dealt with the issue ofscope generation from a computational perspective Attempts have also been madein computational work to extend a pure Cooper storage approach to handle scopeprediction Hobbs and Shieber (1987) discuss the possibility of incorporating some sortof ordering heuristics into the SRI scope generation system in the hopes of producinga ranked list of possible scope readings but ultimately are forced to acknowledge thatldquo[t]he modications turn out to be quite complicated if we wish to order quantiersaccording to lexical heuristics such as having each out-scope some Because of therecursive nature of the algorithm there are limits to the amount of ordering that canbe done in this mannerrdquo (page 55) The stepwise nature of these scope mechanismsmakes it hard to state the factors that inuence the preference for one quantier totake scope over another

Those natural language processing (NLP) systems that have managed to providesome sort of account of quantier scope preferences have done so by using a separatesystem of heuristics (or scope critics) that apply postsyntactically to determine the mostlikely scoping LUNAR (Woods 1986) TEAM (Martin Appelt and Pereira 1986) andthe SRI Core Language Engine as described by Moran (1988 Moran and Pereira 1992)all employ scope rules of this sort By and large these rules are of an ad hoc natureimplementing a linguistrsquos intuitive idea of what factors determine scope possibilitiesand no results have been published regarding the accuracy of these methods Forexample Moran (1988) incorporates rules from other NLP systems and from VanLehn(1978) such as a preference for a logically weaker interpretation the tendency for eachto take wide scope and a ban on raising a quantier across multiple major clauseboundaries The testing of Moranrsquos system is ldquolimited to checking conformance tothe stated rulesrdquo (pages 40ndash41) In addition these systems are generally incapable ofhandling unrestricted text such as that found in the Wall Street Journal corpus in arobust way because they need to do a full semantic analysis of a sentence in orderto make scope predictions The statistical basis of the model presented in this articleoffers increased robustness and the possibility of more serious evaluation on the basisof corpus data

4 Modeling Quantier Scope

In this section we argue for an empirically driven machine learning approach tothe identication of factors relevant to quantier scope and the modeling of scopepreferences Following much recent work that applies the tools of machine learning tolinguistic problems (Brill 1995 Pedersen 2000 van Halteren Zavrel and Daelemans2001 Soon Ng and Lim 2001) we will treat the prediction of quantier scope asan example of a classication task Our aim is to provide a robust model of scopeprediction based on Kuno Takami and Wursquos theoretical foundation and to addressthe serious lack of empirical results regarding quantier scope in computational workWe describe here the modeling tools borrowed from the eld of articial intelligencefor the scope prediction task and the data from which the generalizations are to belearned Finally we present the results of training different incarnations of our scopemodule on the data and assess the implications of this exercise for theoretical andcomputational linguistics

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 6: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

78

Computational Linguistics Volume 29 Number 1

41 Classication in Machine LearningDetermining which among multiple quantiers in a sentence takes wide scope givena number of different sources of evidence is an example of what is known in machinelearning as a classication task (Mitchell 1996) There are many types of classiersthat may be applied to this task that both are more sophisticated than the approachsuggested by Kuno Takami and Wu and have a more solid probabilistic foundationThese include the naive Bayes classier (Manning and Schutze 1999 Jurafsky andMartin 2000) maximum-entropy models (Berger Della Pietra and Della Pietra 1996Ratnaparkhi 1997) and the single-layer perceptron (Bishop 1995) We employ theseclassier models here primarily because of their straightforward probabilistic inter-pretation and their similarity to the scope model of Kuno Takami and Wu (since theyeach could be said to implement a kind of weighted voting of factors) In Section 43we describe how classiers of these types can be constructed to serve as a grammaticalmodule responsible for quantier scope determination

All of these classiers can be trained in a supervised manner That is given a sam-ple of training data that provides all of the information that is deemed to be relevantto quantier scope and the actual scope reading assigned to a sentence these classi-ers will attempt to extract generalizations that can be fruitfully applied in classifyingas-yet-unseen examples

42 DataThe data on which the quantier scope classiers are trained and tested is an extractfrom the Penn Treebank (Marcus Santorini and Marcinkiewicz 1993) that we havetagged to indicate the most salient scope interpretation of each sentence in contextFigure 3 shows an example of a training sentence with the scope reading indicatedThe quantier lower in the tree bears the tag ldquoQ1rdquo and the higher quantier bears thetag ldquoQ2rdquo so this sentence is interpreted such that the lower quantier has wide scopeReversing the tags would have meant that the higher quantier takes wide scope andwhile if both quantiers had been marked ldquoQ1rdquo this would have indicated that thereis no scope interaction between them (as when they are logically independent or takescope in different conjuncts of a conjoined phrase)2

The sentences tagged were chosen from the Wall Street Journal (WSJ) section ofthe Penn Treebank to have a certain set of attributes that simplify the task of design-ing the quantier scope module of the grammar First in order to simplify the codingprocess each sentence has exactly two scope-taking elements of the sort consideredfor this project3 These include most NPs that begin with a determiner predetermineror quantier phrase (QP)4 but exclude NPs in which the determiner is a an or the Ex-

2 This ldquono interactionrdquo class is a sort of ldquoelsewhererdquo category that results from phrasing the classicationquestion as ldquoWhich quantier takes wider scope in the preferred readingrdquo Where there is no scopeinteraction the answer is ldquoneitherrdquo This includes cases in which the relative scope of operators doesnot correspond to a difference in meaning as in One woman bought one horse or when they take scopein different propositional domains such as in Mary bought two horses and sold three sheep The humancoders used in this study were instructed to choose class 0 whenever there was not a clear preferencefor one of the two scope readings

3 This restriction that each sentence contain only two quantied elements does not actually excludemany sentences from consideration We identied only 61 sentences with three quantiers of the sortwe consider and 12 sentences with four In addition our review of these sentences revealed that manyof them simply involve lists in which the quantiers do not interact in terms of scope (as in forexample ldquoWe ask that you turn off all cell phones extinguish all cigarettes and open any candy beforethe performance beginsrdquo) Thus the class of sentences with more than two quantiers is small andseems to involve even simpler quantier interactions than those found in our corpus

4 These categories are intended to be understood as they are used in the tagging and parsing of the PennTreebank See Santorini (1990) and Bies et al (1995) for details the Appendix lists selected codes used

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 7: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

79

Higgins and Sadock Modeling Scope Preferences

( (S(NP-SBJ

(NP (DT Those) )(SBAR

(WHNP-1 (WP who) )(S

(NP-SBJ-2 (-NONE- T-1) )(ADVP (RB still) )(VP (VBP want)

(S(NP-SBJ (-NONE- -2) )(VP (TO to)

(VP (VB do)(NP (PRP it) ))))))))

(lsquo lsquo lsquo lsquo )(VP (MD will)

(ADVP (RB just) )(VP (VB find)

(NP(NP (DT-Q2 some) (NN way) )(SBAR

(WHADVP-3 (-NONE- 0) )(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB get)(PP (IN around) (rsquorsquo rsquorsquo)

(NP (DT-Q1 any) (NN attempt)(S

(NP-SBJ (-NONE- ) )(VP (TO to)

(VP (VB curb)(NP (PRP it) ))))))

(ADVP-MNR (-NONE- T-3) ))))))))( ) ))

Figure 3Tagged Wall Street Journal text from the Penn Treebank The lower quantier takes widescope indicated by its tag ldquoQ1rdquo

cluding these determiners from consideration largely avoids the problem of genericsand the complexities of assigning scope readings to denite descriptions In addi-tion only sentences that had the root node S were considered This serves to excludesentence fragments and interrogative sentence types Our data set therefore differssystematically from the full WSJ corpus but we believe it is sufcient to allow manygeneralizations about English quantication to be induced Given these restrictions onthe input data the task of the scope classier is a choice among three alternatives5

(Class 0) There is no scopal interaction

(Class 1) The rst quantier takes wide scope

(Class 2) The second quantier takes wide scope

for annotating the Penn Treebank corpus The category QP is particularly unintuitive in that it does notcorrespond to a quantied noun phrase but to a measure expression such as more than half

5 Some linguists may nd it strange that we have chosen to treat the choice of preferred scoping for twoquantied elements as a tripartite decision since the possibility of independence is seldom treated inthe linguistic literature As we are dealing with corpus data in this experiment we cannot afford toignore this possibility

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 8: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

80

Computational Linguistics Volume 29 Number 1

The result is a set of 893 sentences6 annotated with Penn Treebank II parse trees andhand-tagged for the primary scope reading

To assess the reliability of the hand-tagged data used in this project the data werecoded a second time by an independent coder in addition to the reference codingThe independent codings agreed with the reference coding on 763 of sentences Thekappa statistic (Cohen 1960) for agreement was 52 with a 95 condence intervalbetween 40 and 64 Krippendorff (1980) has been widely cited as advocating theview that kappa values greater than 8 should be taken as indicating good reliabilitywith values between 67 and 8 indicating tentative reliability but we are satisedwith the level of intercoder agreement on this task As Carletta (1996) notes manytasks in computational linguistics are simply more difcult than the content analysisclassications addressed by Krippendorff and according to Fleiss (1981) kappa valuesbetween 4 and 75 indicate fair to good agreement anyhow

Discussion between the coders revealed that there was no single cause for their dif-ferences in judgments when such differences existed Many cases of disagreement stemfrom different assumptions regarding the lexical quantiers involved For example thecoders sometimes differed on whether a given instance of the word any correspondsto a narrow-scope existential as we conventionally treat it when it is in the scope ofnegation or the ldquofree-choicerdquo version of any To take another example two universalquantiers are independent in predicate calculus (8x8y[ iquest ] () 8y8x[ iquest ]) but in creat-ing our scope-tagged corpus it was often difcult to decide whether two universal-likeEnglish quantiers (such as each any every and all) were actually independent in agiven sentence Some differences in coding stemmed from coder disagreements aboutwhether a quantier within a xed expression (eg all the hoopla) truly interacts withother operators in the sentence Of course another major factor contributing to inter-coder variation is the fact that our data sentences taken from Wall Street Journal textare sometimes quite long and complex in structure involving multiple scope-takingoperators in addition to the quantied NPs In such cases the coders sometimes haddifculty clearly distinguishing the readings in question

Because of the relatively small amount of data we had we used the technique oftenfold cross-validation in evaluating our classiers in each case choosing 89 of the893 total data sentences from the data as a test set and training on the remaining 804We preprocessed the data in order to extract the information from each sentence thatwe would be treating as relevant to the prediction of quantier scoping in this project(Although the initial coding of the preferred scope reading for each sentence was donemanually this preprocessing of the data was done automatically) At the end of thispreprocessing each sentence was represented as a record containing the followinginformation (see the Appendix for a list of annotation codes for Penn Treebank)

deg the syntactic category according to Penn Treebank conventions of therst quantier (eg DT for each NN for everyone or QP for more than half )

deg the rst quantier as a lexical item (eg each or everyone) For a QPconsisting of multiple words this eld contains the head word or ldquoCDrdquoin case the head is a cardinal number

deg the syntactic category of the second quantier

deg the second quantier as a lexical item

6 These data have been made publicly available to all licensees of the Penn Treebank by means of apatch le that may be retrieved from hhttphumanitiesuchicagoedulinguisticsstudentsdchigginqscope-datatgzi This le also includes the coding guidelines used for this project

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 9: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

81

Higgins and Sadock Modeling Scope Preferences

2

66666666666666666666666666664

class 2

rst cat DTrst head somesecond cat DTsecond head anyjoin cat NPrst c-commands YESsecond c-commands NOnodes intervening 6

VP intervenes YESADVP intervenes NO

S intervenes YES

conj intervenes NO

intervenes NO intervenes NO

rdquo intervenes YES

3

77777777777777777777777777775

Figure 4Example record corresponding to the sentence shown in Figure 3

deg the syntactic category of the lowest node dominating both quantiedNPs (the ldquojoinrdquo node)

deg whether the rst quantied NP c-commands the second

deg whether the second quantied NP c-commands the rst

deg the number of nodes intervening7 between the two quantied NPs

deg a list of the different categories of nodes that intervene between thequantied NPs (actually for each nonterminal category there is a distinctbinary feature indicating whether a node of that category intervenes)

deg whether a conjoined node intervenes between the quantied NPs

deg a list of the punctuation types that are immediately dominated by nodesintervening between the two NPs (again for each punctuation tag in thetreebank there is a distinct binary feature indicating whether suchpunctuation intervenes)

Figure 4 illustrates how these features would be used to encode the example in Fig-ure 3

The items of information included in the record as listed above are not the exactfactors that Kuno Takami and Wu (1999) suggest be taken into consideration in mak-ing scope predictions and they are certainly not sufcient to determine the properscope reading for all sentences completely Surely pragmatic factors and real-worldknowledge inuence our interpretations as well although these are not representedhere This list does however provide information that could potentially be useful inpredicting the best scope reading for a particular sentence For example information

7 We take a node to intervene between two other nodes macr and deg in a tree if and only if plusmn is the lowestnode dominating both macr and deg plusmn dominates or plusmn = and dominates either macr or deg

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 10: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

82

Computational Linguistics Volume 29 Number 1

Table 2Baseline performance summed over all ten test sets

Condition Correct Incorrect Percentage correct

First has wide scope 0 64 064 = 00Second has wide scope 0 281 0281 = 00

No scope interaction 545 0 545545 = 1000Total 545 345 545890 = 612

about whether one quantied NP in a given sentence c-commands the other corre-sponds to Kuno Takami and Wursquos observation that subject quantiers tend to takewide scope over object quantiers and topicalized quantiers tend to outscope ev-erything The identity of each lexical quantier clearly should allow our classiers tomake the generalization that each tends to take wide scope if this word is found inthe data and perhaps even learn the regularity underlying Kuno Takami and Wursquosobservation that universal quantiers tend to outscope existentials

43 Classier DesignIn this section we present the three types of model that we have trained to predictthe preferred quantier scoping on Penn Treebank sentences a naive Bayes classiera maximum-entropy classier and a single-layer perceptron8 In evaluating how wellthese models do in assigning the proper scope reading to each test sentence it is im-portant to have a baseline for comparison The baseline model for this task is one thatsimply guesses the most frequent category of the data (ldquono scope interactionrdquo) everytime This simplistic strategy already classies 612 of the test examples correctly asshown in Table 2

It may surprise some linguists that this third class of sentences in which there isno scopal interaction between the two quantiers is the largest In part this may bedue to special features of the Wall Street Journal text of which the corpus consists Forexample newspaper articles may contain more direct quotations than other genres Inthe process of tagging the data however it was also apparent that in a large proportionof cases the two quantiers were taking scope in different conjuncts of a conjoinedphrase This further tendency supports the idea that people may intentionally avoidconstructions in which there is even the possibility of quantier scope interactionsperhaps because of some hearer-oriented pragmatic principle Linguists may also beconcerned that this additional category in which there is no scope interaction betweenquantiers makes it difcult to compare the results of the present work with theoreticalaccounts of quantier scope that ignore this case and concentrate on instances in whichone quantier does take scope over another In response to such concerns howeverwe point out rst that we provide a model of scope prediction rather than scopegeneration and so it is in any case not directly comparable with work in theoreticallinguistics which has largely ignored scope preferences Second we point out thatthe empirical nature of this study requires that we take note of cases in which thequantiers simply do not interact

8 The implementations of these classiers are publicly available as Perl modules at hhttphumanitiesuchicagoedulinguisticsstudentsdchigginclassierstgzi

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 11: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

83

Higgins and Sadock Modeling Scope Preferences

Table 3Performance of the naive Bayes classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 177 104 177281 = 630Second has wide scope 41 23 4164 = 641

No scope interaction 428 117 428545 = 785Total 646 244 646890 = 726

431 Naive Bayes Classier Our data D will consist of a vector of features (d0 dn)that represent aspects of the sentence under examination such as whether one quan-tied expression c-commands the other as described in Section 42 The fundamentalsimplifying assumption that we make in designing a naive Bayes classier is thatthese features are independent of one another and therefore can be aggregated as in-dependent sources of evidence about which class c curren a given sentence belongs to Thisindependence assumption is formalized in equations (1) and (2)

c curren = arg maxc

P(c)P(d0 dn j c) (1)

ordm arg maxc

P(c)nY

k = 0

P(dk j c) (2)

We constructed an empirical estimate of the prior probability P(c) by simply count-ing the frequency with which each class occurs in the training data We constructedeach P(dk j c) by counting how often each feature dk co-occurs with the class c toconstruct the empirical estimate P(dk j c) and interpolated this with the empiricalfrequency P(dk) of the feature dk not conditioned on the class c This interpolatedprobability model was used in order to smooth the probability distribution avoidingthe problems that can arise if certain feature-value pairs are assigned a probability ofzero

The performance of the naive Bayes classier is summarized in Table 3 For eachof the 10 test sets of 89 items taken from the corpus the remaining 804 of the total893 sentences were used to train the model The naive Bayes classier outperformedthe baseline by a considerable margin

In addition to the raw counts of test examples correctly classied though wewould like to know something of the internal structure of the model (ie what sort offeatures it has induced from the data) For this classier we can assume that a featuref is a good predictor of a class c curren when the value of P(f j c curren ) is signicantly largerthan the (geometric) mean value of P(f j c) for all other values of c Those featureswith the greatest ratio P(f ) pound P(f jccurren)

geommean( 8 c 6= ccurren [P(f jc)])are listed in Table 49

The rst-ranked feature in Table 4 shows that there is a tendency for quanti-ed elements not to interact when they are found in conjoined constituents and thesecond-ranked feature indicates a preference for quantiers not to interact when thereis an intervening comma (presumably an indicator of greater syntactic ldquodistancerdquo)Feature 3 indicates a preference for class 1 when there is an intervening S node

9 We include the term P(f ) in the product in order to prevent sparsely instantiated features fromshowing up as highly-ranked

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 12: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

84

Computational Linguistics Volume 29 Number 1

Table 4Most active features from naive Bayes classier

Rank Feature Predicted Ratioclass

1 There is an intervening conjunct node 0 1632 There is an intervening comma 0 1513 There is an intervening S node 1 1334 The rst quantied NP does not c-command the second 0 1255 Second quantier is tagged QP 1 1166 There is an intervening S node 0 112

15 The second quantied NP c-commands the rst 2 100

whereas feature 6 indicates a preference for class 0 under the same conditions Pre-sumably this reects a dispreference for the second quantier to take wide scopewhen there is a clause boundary intervening between it and the rst quantier Thefourth-ranked feature in Table 4 indicates that if the rst quantied NP does notc-command the second it is less likely to take wide scope This is not surprisinggiven the importance that c-command relations have had in theoretical discussionsof quantier scope The fth-ranked feature expresses a preference for quantied ex-pressions of category QP to take narrow scope if they are the second of the twoquantiers under consideration This may simply be reective of the fact that class1 is more common than class 2 and the measure expressions found in QP phrasesin the Penn Treebank (such as more than three or about half ) tend not to be logicallyindependent of other quantiers Finally the feature 15 in Table 4 indicates a highcorrelation between the second quantied expressionrsquos c-commanding the rst andthe second quantierrsquos taking wide scope We can easily see this as a translation intoour feature set of Kuno Takami and Wursquos claim that subjects tend to outscope ob-jects and obliques and topicalized elements tend to take wide scope Some of thesetop-ranked features have to do with information found only in the written mediumbut on the whole the features induced by the naive Bayes classier seem consis-tent with those suggested by Kuno Takami and Wu although they are distinct bynecessity

432 Maximum-Entropy Classier The maximum-entropy classier is a sort of log-linear model dening the joint probability of a class and a data vector (d0 dn) asthe product of the prior probability of the class c with a set of features related to thedata10

P(d0 dn c) =P(c)

Z

nY

k = 0

not k (3)

This classier supercially resembles in form the naive Bayes classier in equation (2)but it differs from that classier in that the way in which values for each not are chosendoes not assume that the features in the data are independent For each of the 10training sets we used the generalized iterative scaling algorithm to train this classieron 654 training examples using 150 examples for validation to choose the best set of

10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probabilitydistribution

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 13: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

85

Higgins and Sadock Modeling Scope Preferences

Table 5Performance of the maximum-entropy classier summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 148 133 148281 = 527Second has wide scope 31 33 3164 = 484

No scope interaction 475 70 475545 = 872Total 654 236 654890 = 735

Table 6Most active features from maximum-entropy classier

Rank Feature Predicted c25

class

1 Second quantier is each 2 1132 There is an intervening comma 0 1013 There is an intervening conjunct node 0 1004 First quantied NP does not c-command the second 0 0995 Second quantier is every 2 0986 There is an intervening quotation mark (rdquo) 0 0957 There is an intervening colon 0 095

12 First quantied NP c-commands the second 1 09225 There is no intervening comma 1 090

values for the not s11 Test data could then be classied by choosing the class for the datathat maximizes the joint probability in equation (3)

The results of training with the maximum-entropy classier are shown in Table 5The classier showed slightly higher performance than the naive Bayes classier withthe lowest error rate on the class of sentences having no scope interaction

To determine exactly which features of the data the maximum-entropy classiersees as relevant to the classication problem we can simply look at the not values (fromequation (3)) for each feature Those features with higher values for not are weightedmore heavily in determining the proper scoping Some of the features with the highestvalues for not are listed in Table 6 Because of the way the classier is built predictorfeatures for class 2 need to have higher loadings to overcome the lower prior probabil-ity of the class Therefore we actually rank the features in Table 6 according to not P(c)k

(which we denote as not ck) P(c) represents the empirical prior probability of a class cand k is simply a constant (25 in this case) chosen to try to get a mix of features fordifferent classes at the top of the list

The features ranked rst and fth in Table 6 express lexical preferences for certainquantiers to take wide scope even when they are the second of the two quantiersaccording to linear order in the string of words The tendency for each to take widescope is stronger than for the other quantier which is in line with Kuno Takamiand Wursquos decision to list it as the only quantier with a lexical preference for scopingFeature 2 makes the ldquono scope interactionrdquo class more likely if a comma intervenes and

11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm Forefciency reasons however we chose to take the training corpus as representative of the event spacerather than enumerating the space exhaustively (see Jelinek [1998] for details) For this reason it wasnecessary to employ validation in training

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 14: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

86

Computational Linguistics Volume 29 Number 1

Table 7Performance of the single-layer perceptron summed over all 10 test runs

Condition Correct Incorrect Percentage correct

First has wide scope 182 99 182281 = 648Second has wide scope 35 29 3564 = 547

No scope interaction 468 77 468545 = 859Total 685 205 685890 = 770

feature 25 makes a wide-scope reading for the rst quantier more likely if there is nointervening comma The third-ranked feature expresses the tendency mentioned abovefor quantiers in conjoined clauses not to interact Features 4 and 12 indicate that if therst quantied expression c-commands the second it is likely to take wide scope andthat if this is not the case there is likely to be no scope interaction Finally the sixth-and seventh-ranked features in the table show that an intervening quotation mark orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo which is easyto understand Quotations are often opaque to quantier scope interactions The topfeatures found by the maximum-entropy classier largely coincide with those foundby the naive Bayes model which indicates that these generalizations are robust andobjectively present in the data

433 Single-Layer Perceptron For our neural network classier we employed a feed-forward single-layer perceptron with the softmax function used to determine the acti-vation of nodes at the output layer because this is a one-of-n classication task (Bridle1990) The data to be classied are presented as a vector of features at the input layerand the output layer has three nodes representing the three possible classes for thedata ldquorst has wide scoperdquo ldquosecond has wide scoperdquo and ldquono scope interactionrdquoThe output node with the highest activation is interpreted as the class of the datumpresented at the input layer

For each of the 10 test sets of 89 examples we trained the connection weightsof the network using error backpropagation on 654 training sentences reserving 150sentences for validation in order to choose the weights from the training epoch with thehighest classication performance In Table 7 we present the results of the single-layerneural network in classifying our test sentences As the table shows the single-layerperceptron has much better classication performance than the naive Bayes classierand maximum-entropy model possibly because the training of the network aims tominimize error in the activation of the classication output nodes which is directlyrelated to the classication task at hand whereas the other models do not directlymake use of the notion of ldquoclassication errorrdquo The perceptron also uses a sort ofweighted voting and could be interpreted as an implementation of Kuno Takamiand Wursquos proposal for scope determination This clearly illustrates that the tenabilityof their proposal hinges on the exact details of its implementation since all of ourclassier models are reasonable interpretations of their approach but they have verydifferent performance results on our scope determination task

To determine exactly which features of the data the network sees as relevant tothe classication problem we can simply look at the connection weights for eachfeature-class pair Higher connection weights indicate a greater correlation betweeninput features and output classes For one of the 10 networks we trained some ofthe features with the highest connection weights are listed in Table 8 Since class 0 is

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 15: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

87

Higgins and Sadock Modeling Scope Preferences

Table 8Most active features from single-layer perceptron

Rank Feature Predicted Weightclass

1 There is an intervening comma 0 4312 Second quantier is all 0 3773 There is an intervening colon 0 2984 There is an intervening conjunct node 0 272

17 The rst quantied NP c-commands the second 1 16918 Second quantier is tagged RBS 2 16919 There is an intervening S node 1 16120 Second quantier is each 2 150

simply more frequent in the training data than the other two classes the weights forthis class tend to be higher Therefore we also list some of the best predictor featuresfor classes 1 and 2 in the table

The rst- and third-ranked features in Table 8 show that an intervening comma orcolon will make the classier tend toward class 0 ldquono scope interactionrdquo This ndingby the classier is similar to the maximum-entropy classierrsquos nding an interveningquotation mark relevant and can be taken as an indication that quantiers in distantsyntactic subdomains are unlikely to interact Similarly the fourth-ranked feature indi-cates that quantiers in separate conjuncts are unlikely to interact The second-rankedfeature in the table expresses a tendency for there to be no scope interaction betweentwo quantiers if the second of them is headed by all This may be related to theindependence of universal quantiers (8x8y[iquest ] () 8y8x[iquest ]) Feature 17 in Table 8indicates a high correlation between the rst quantied expressionrsquos c-commandingthe second and the rst quantierrsquos taking wide scope which again supports KunoTakami and Wursquos claim that scope preferences are related to syntactic superiority re-lations Feature 18 expresses a preference for a quantied expression headed by mostto take wide scope even if it is the second of the two quantiers (since most is theonly quantier in the corpus that bears the tag RBS) Feature 19 indicates that therst quantier is more likely to take wide scope if there is a clause boundary in-tervening between the two quantiers which supports the notion that the syntacticdistance between the quantiers is relevant to scope preferences Finally feature 20expresses the well-known tendency for quantied expressions headed by each to takewide scope

44 Summary of ResultsTable 9 summarizes the performance of the quantier scope models we have presentedhere All of the classiers have test set accuracy above the baseline which a pairedt-test reveals to be signicant at the 001 level The differences between the naiveBayes maximum-entropy and single-layer perceptron classiers are not statisticallysignicant

The classiers performed signicantly better on those sentences annotated consis-tently by both human coders at the beginning of the study reinforcing the view thatthis subset of the data is somehow simpler and more representative of the basic regu-larities in scope preferences For example the single-layer perceptron classied 829of these sentences correctly To further investigate the nature of the variation betweenthe two coders we constructed a version of our single-layer network that was trained

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 16: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

88

Computational Linguistics Volume 29 Number 1

Table 9Summary of classier results

Training data Validation data Test data

Baseline mdash mdash 612Na otilde ve Bayes 767 mdash 726

Maximum entropy 783 755 735Single-layerperceptron 847 768 770

on the data on which both coders agreed and tested on the remaining sentences Thisclassier agreed with the reference coding (the coding of the rst coder) 514 of thetime and with the additional independent coder 358 of the time The rst coder con-structed the annotation guidelines for this project and may have been more successfulin applying them consistently Alternatively it is possible that different individuals usedifferent strategies in determining scope preferences and the strategy of the secondcoder may simply have been less similar than the strategy of the rst coder to that ofthe single-layer network

These three classiers directly implement a sort of weighted voting the methodof aggregating evidence proposed by Kuno Takami and Wu (although the classiersrsquoimplementation is slightly more sophisticated than the unweighted voting that is ac-tually used in Kuno Takami and Wursquos paper) Of course since we do not use exactlythe set of features suggested by Kuno Takami and Wu our model should not beseen as a straightforward implementation of the theory outlined in their 1999 paperNevertheless the results in Table 9 suggest that Kuno Takami and Wursquos suggesteddesign can be used with some success in modeling scope preferences Moreover theproject undertaken here provides an answer to some of the objections that Aoun andLi (2000) raise to Kuno Takami and Wu Aoun and Li claim that Kuno Takami andWursquos choice of experts is seemingly arbitrary and that it is unclear how the votingweights of each expert are to be set but the machine learning approach we employin this article is capable of addressing both of these potential problems Supervisedtraining of our classiers is a straightforward approach to setting the weights andalso constitutes our approach to selecting features (or ldquoexpertsrdquo in Kuno Takami andWursquos terminology) In the training process any feature that is irrelevant to scopingpreferences should receive weights that make its effect negligible

5 Syntax and Scope

In this section we show how the classier models of quantier scope determinationintroduced in Section 4 may be integrated with a PCFG model of syntax We com-pare two different ways in which the two components may be combined which mayloosely be termed serial and parallel and argue for the latter on the basis of empiricalresults

51 Modular DesignOur use of a phrase structure syntactic component and a quantier scope componentto dene a combined language model is simplied by the fact that our classiers areprobabilistic and dene a conditional probability distribution over quantier scopingsThe probability distributions that our classiers dene for quantier scope structuresare conditional on syntactic phrase structure because they are computed on the basis

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 17: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

89

Higgins and Sadock Modeling Scope Preferences

of syntactically provided features such as the number of nodes of a certain type thatintervene between two quantiers in a phrase structure tree

Thus the combined language model that we dene in this article assigns probabil-ities according to the pairs of structures that may be assigned to a sentence by the Q-structure and phrase structure syntax modules The probability of a word string w1iexclnis therefore dened as in equation (4) where Q ranges over all possible Q-structuresin the set Q and S ranges over all possible syntactic structures in the set S

P(w1iexcln) =X

S2 SQ 2 QP(S Q j w1iexcln) (4)

=X

S2 SQ 2 QP(S j w1iexcln)P(Q j S w1iexcln) (5)

Equation (5) shows how we can use the denition of conditional probability tobreak our calculation of the language model probability into two parts The rst ofthese parts P(S j w1iexcln) which we may abbreviate as simply P(S) is the probabilityof a particular syntactic tree structurersquos being assigned to a particular word string Wemodel this probability using a probabilistic phrase structure grammar (cf Charniak[1993 1996]) The second distribution on the right side of equation (5) is the conditionalprobability of a particular quantier scope structurersquos being assigned to a particularword string given the syntactic structure of that string This probability is written asP(Q j S w1iexcln) or simply P(Q j S) and represents the quantity we estimated abovein constructing classiers to predict the scopal representation of a sentence based onaspects of its syntactic structure

Thus given a PCFG model of syntactic structure and a probabilistically denedclassier of the sort introduced in Section 4 it is simple to determine the probabilityof any pairing of two particular structures from each domain for a given sentenceWe simply multiply the values of P(S) and P(Q j S) to obtain the joint probabilityP(Q S) In the current section we examine two different models of combination forthese components one in which scope determination is applied to the optimal syn-tactic structure (the Viterbi parse) and one in which optimization is performed in thespace of both modules to nd the optimal pairing of syntactic and quantier scopestructures

52 The Syntactic ModuleBefore turning to the application of our multimodular approach to the problem ofscope determination in Section 53 we present here a short overview of the phrasestructure syntactic component used in these projects As noted above we model syn-tax as a probabilistic phrase structure grammar (PCFG) and in particular we use atreebank grammar (Charniak 1996) trained on the Penn Treebank

A PCFG denes the probability of a string of words as the sum of the probabilitiesof all admissible phrase structure parses (trees) for that string The probability of agiven tree is the product of the probability of all of the rule instances used in theconstruction of that tree where rules take the form N iquest with N a nonterminalsymbol and iquest a nite sequence of one or more terminals or nonterminals

To take an example Figure 5 illustrates a phrase structure tree for the sentence Su-san might not believe you which is admissible according to the grammar in Table 10 (Allof the minimal subtrees in Figure 5 are instances of one of our rules) The probability

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 18: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

90

Computational Linguistics Volume 29 Number 1

Figure 5A simple phrase structure tree

Table 10A simple probabilistic phrase structure grammar

Rule Probability

S NP VP 7S VP 2S V NP VP 1

VP V VP 3VP ADV VP 1VP V 1VP V NP 3VP V NP NP 2

NP Susan 3NP you 4NP Yves 3

V might 2V believe 3V show 3V stay 2

ADV not 5ADV always 5

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 19: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

91

Higgins and Sadock Modeling Scope Preferences

of this tree which we can indicate as frac12 can be calculated as in equation (6)

P( frac12 ) =Y

raquo 2 ( frac12 )

P( raquo ) (6)

= P(S NP VP) pound P(VP V VP) pound P(VP ADV VP)

pound P(VP V NP) pound P(NP Susan) pound P(V might)

pound P(ADV not) pound P(V believe) pound P(NP you) (7)

= 7 pound 3 pound 1 pound 3 pound 3 pound 2 pound 5 pound 3 pound 4 = 2268 pound 10iexcl5 (8)

The actual grammar rules and associated probabilities that we use in dening oursyntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-likelihood estimation That is for each rule N iquest used in the treebank we add therule to the grammar and set its probability to C(N iquest )P

AacuteC(N Aacute)

where C( ) denotes the

ldquocountrdquo or a rule (ie the number of times it is used in the corpus) A grammarcomposed in this manner is referred to as a treebank grammar because its rules aredirectly derived from those in a treebank corpus

We used sections 00ndash20 of the WSJ corpus of the Penn Treebank for collecting therules and associated probabilities of our PCFG which is implemented as a bottom-upchart parser Before constructing the grammar the treebank was preprocessed usingknown procedures (cf Krotov et al [1998] Belz [2001]) to facilitate the construction ofa rule list Functional and anaphoric annotations (basically anything following a ldquo-rdquoin a node label cf Santorini [1990] Bies et al [1995]) were removed from nonterminallabels Nodes that dominate only ldquoempty categoriesrdquo such as traces were removedIn addition unary-branching constructions were removed by replacing the mothercategory in such a structure with the daughter node (For example given an instanceof the rule X YZ if the daughter category Y were expanded by the unary ruleY W our algorithm would induce the single rule X WZ) Finally we discardedall rules that had more than 10 symbols on the right-hand side (an arbitrary limit ofour parser implementation) This resulted in a total of 18820 rules of which 11156were discarded as hapax legomena leaving 7664 rules in our treebank grammarTable 11 shows some of the rules in our grammar with the highest and lowest corpuscounts

53 Unlabeled Scope DeterminationIn this section we describe an experiment designed to assess the performance ofparallel and serial approaches to combining grammatical modules focusing on thetask of unlabeled scope determination This task involves predicting the most likelyQ-structure representation for a sentence basically the same task we addressed in Sec-tion 4 in comparing the performance levels of each type of classier The experimentof this section differs however from the task presented in Section 4 in that instead ofproviding a syntactic tree from the Penn Treebank as input to the classier we providethe model only with a string of words (a sentence) Our dual-component model willsearch for the optimal syntactic and scopal structures for the sentence (the pairing( frac12 curren Agrave curren )) and will be evaluated based on its success in identifying the correct scopereading Agrave curren

Our concern in this section will be to determine whether it is necessary to searchthe space of possible pairings ( frac12 Agrave ) of syntactic and scopal structures or whetherit is sufcient to use our PCFG rst to x the syntactic tree frac12 and then to choose

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 20: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

92

Computational Linguistics Volume 29 Number 1

Table 11Rules derived from sections 00ndash20 of the Penn Treebank WSJ corpus ldquoTOPrdquo is a special ldquostartrdquosymbol that may expand to any of the symbols found at the root of a tree in the corpus

Rule Corpus count

PP IN NP 59053TOP S 34614NP DT NN 28074NP NP PP 25192S NP VP 14032S NP VP 12901VP TO VP 11598

S CC PP NNP NNP VP 2NP DT ldquo NN NN NN 0 0 2NP NP PP PP PP PP PP 2INTJ UH UH 2NP DT ldquo NN NNS 2SBARQ ldquo WP VP 0 0 2S PP NP VP 0 0 2

a scope reading to maximize the probability of the pairing That is are syntax andquantier scope mutually dependent components of grammar or can scope relationsbe ldquoread off ofrdquo syntax The serial model suggests that the optimal syntactic structurefrac12 curren should be chosen on the basis of the syntactic module only as in equation (9)and the optimal quantier scope structure Agrave curren then chosen on the basis of frac12 curren as inequation (10) The parallel model on the other hand suggests that the most likelypairing of structures must be chosen in the joint probability space of both componentsas in equation (11)

frac12 curren = arg maxfrac12 2 S

PS( frac12 j w1iexcln) (9)

Agrave curren = arg maxAgrave 2 Q

PQ( Agrave j frac12 curren w1iexcln) (10)

frac12 curren = f frac12 j ( frac12 Agrave ) = arg maxfrac12 2 S Agrave 2 Q

PS( frac12 j w1iexcln)PQ( Agrave j frac12 w1iexcln) g (11)

531 Experimental Design For this experiment we implement the scoping compo-nent as a single-layer feed-forward network because the single-layer perceptron clas-sier had the best prediction rate among the three classiers tested in Section 4 Thesoftmax activation function we use for the output nodes of the classier guaranteesthat the activations of all of the output nodes sum to one and can be interpreted asclass probabilities The syntactic component of course is determined by the treebankPCFG grammar described above

Given these two models which respectively dene PQ( Agrave j frac12 w1iexcln) and PS( frac12 j w1iexcln)from equation (11) it remains only to specify how to search the space of pairings( frac12 Agrave ) in performing this optimization to nd Agrave curren Unfortunately it is not feasible toexamine all values frac12 2 S since our PCFG will generally admit a huge number of

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 21: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

93

Higgins and Sadock Modeling Scope Preferences

Table 12Performance of models on the unlabeled scope prediction task summed over all 10 test runs

Condition Correct Incorrect Percentage correct

Parallel modelFirst has wide scope 168 113 167281 = 594Second has wide scope 26 38 2664 = 406No scope interaction 467 78 467545 = 857Total 661 229 661890 = 743

Serial modelFirst has wide scope 163 118 163281 = 580Second has wide scope 27 37 2764 = 422No scope interaction 461 84 461545 = 846Total 651 239 651890 = 731

trees for a sentence (especially given a mean sentence length of over 20 words inthe WSJ corpus)12 Our solution to this search problem is to make the simplifying as-sumption that the syntactic tree that is used in the optimal set of structures ( frac12 curren Agrave curren )will always be among the top few trees frac12 for which PS( frac12 j w1iexcln) is the greatestThat is although we suppose that quantier scope information is relevant to pars-ing we do not suppose that it is so strong a determinant as to completely over-ride syntactic factors In practice this means that our parser will return the top 10parses for each sentence along with the probabilities assigned to them and theseare the only parses that are considered in looking for the optimal set of linguisticstructures

We again used 10-fold cross-validation in evaluating the competing models di-viding the scope-tagged corpus into 10 test sections of 89 sentences each and weused the same version of the treebank grammar for our PCFG The rst model re-trieved the top 10 syntactic parses ( frac12 0 frac12 9) for each sentence and computed theprobability P( frac12 Agrave ) for each frac12 2 frac12 0 frac12 9 Agrave 2 0 1 2 choosing that scopal represen-tation Agrave that was found in the maximum-probability pairing We call this the par-allel model because the properties of each probabilistic model may inuence theoptimal structure chosen by the other The second model retrieved only the Viterbiparse frac12 0 from the PCFG and chose the scopal representation Agrave for which the pair-ing ( frac12 0 Agrave ) took on the highest probability We call this the serial model because itrepresents syntactic phrase structure as independent of other components of gram-mar (in this case quantier scope) though other components are dependentupon it

532 Results There was an appreciable difference in performance between these twomodels on the quantier scope test sets As shown in Table 12 the parallel modelnarrowly outperformed the serial model by 12 A 10-fold paired t-test on the testsections of the scope-tagged corpus shows that the parallel model is signicantly better(p lt 05)

12 Since we are allowing Acirc to range only over the three scope readings (0 1 2) however it is possible toenumerate all values of Acirc to be paired with a given syntactic tree iquest

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 22: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

94

Computational Linguistics Volume 29 Number 1

This result suggests that in determining the syntactic structure of a sentence wemust take aspects of structure into account that are not purely syntactic (such as quanti-er scope) Searching both dimensions of the hypothesis space for our dual-componentmodel allowed the composite model to handle the interdependencies between differ-ent aspects of grammatical structure whereas xing a phrase structure tree purelyon the basis of syntactic considerations led to suboptimal performance in using thatstructure as a basis for determining quantier scope

6 Conclusion

In this article we have taken a statistical corpus-based approach to the modelingof quantier scope preferences a subject that has previously been addressed onlywith systems of ad hoc rules derived from linguistsrsquo intuitive judgments Our modeltakes its theoretical inspiration from Kuno Takami and Wu (1999) who suggest anldquoexpert systemrdquo approach to scope preferences and follows many other projects in themachine learning of natural language that combine information from multiple sourcesin solving linguistic problems

0ur results are generally supportive of the design that Kuno Takami and Wu pro-pose for the quantier scope component of grammar and some of the features inducedby our models nd clear parallels in the factors that Kuno Takami and Wu claim tobe relevant to scoping In addition our nal experiment in which we combine ourquantier scope module with a PCFG model of syntactic phrase structure providesevidence of a grammatical architecture in which different aspects of structure mutu-ally constrain one another This result casts doubt on approaches in which syntacticprocessing is completed prior to the determination of other grammatical properties ofa sentence such as quantier scope relations

Appendix Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-bank from Marcus et al (1993) and Bies et al (1995)

Part-of-speech tags

Tag Meaning Tag Meaning

CC Conjunction RB AdverbCD Cardinal number RBR Comparative adverbDT Determiner RBS Superlative adverbIN Preposition TO ldquotordquoJJ Adjective UH InterjectionJJR Comparative adjective VB Verb in base formJJS Superlative adjective VBD Past-tense verbNN Singular or mass noun VBG Gerundive verbNNS Plural noun VBN Past participial verbNNP Singular proper noun VBP Non-3sg present-NNPS Plural proper noun tense verbPDT Predeterminer VBZ 3sg present-tense

verbPRP Personal pronoun WP WH pronounPRP$ Possessive pronoun WP$ Possessive WH pronoun

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 23: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

95

Higgins and Sadock Modeling Scope Preferences

Phrasal categories

Code Meaning Code Meaning

ADJP Adjective phrase SBAR Clause introduced byADVP Adverb phrase a subordinatingINTJ Interjection conjunctionNP Noun phrase SBARQ Clause introduced byPP Prepositional phrase a WH phraseQP Quantier phrase (ie SINV Inverted declarative

measure=amount sentencephrase) SQ Inverted yes=no

S Declarative clause question following theWH phrase in SBARQ

VP Verb phrase

AcknowledgmentsThe authors are grateful for an AcademicTechnology Innovation Grant from theUniversity of Chicago which helped tomake this work possible and to JohnGoldsmith Terry Regier Anne Pycha andBob Moore whose advice and collaborationhave considerably aided the researchreported in this article Any remainingerrors are of course our own

ReferencesAoun Joseph and Yen-hui Audrey Li 1993

The Syntax of Scope MIT Press CambridgeAoun Joseph and Yen-hui Audrey Li 2000

Scope structure and expert systems Areply to Kuno et al Language76(1)133ndash155

Belz Anja 2001 Optimisation ofcorpus-derived probabilistic grammars InProceedings of Corpus Linguistics 2001pages 46ndash57

Berger Adam L Stephen A Della Pietraand Vincent J Della Pietra 1996 Amaximum entropy approach to naturallanguage processing ComputationalLinguistics 22(1)39ndash71

Bies Ann Mark Ferguson Karen Katz andRobert MacIntyre 1995 Bracketingguidelines for Treebank II style Technicalreport Penn Treebank Project Universityof Pennsylvania

Bishop Christopher M 1995 NeuralNetworks for Pattern Recognition OxfordUniversity Press Oxford

Bridle John S 1990 Probabilisticinterpretation of feedforwardclassication network outputs withrelationships to statistical patternrecognition In F Fougelman-Soulie and

J Herault editors NeurocomputingmdashAlgorithms Architectures and ApplicationsSpringer-Verlag Berlin pages 227ndash236

Brill Eric 1995 Transformation-basederror-driven learning and naturallanguage processing A case study inpart-of-speech tagging ComputationalLinguistics 21(4)543ndash565

Carden Guy 1976 English QuantiersLogical Structure and Linguistic VariationAcademic Press New York

Carletta Jean 1996 Assessing agreement onclassication tasks The kappa statisticComputational Linguistics 22(2)249ndash254

Charniak Eugene 1993 Statistical languagelearning MIT Press Cambridge

Charniak Eugene 1996 Tree-bankgrammars In AAAI IAAI vol 2pages 1031ndash1036

Cohen Jacob 1960 A coefcient ofagreement for nominal scales Educationaland Psychological Measurement 2037ndash46

Cooper Robin 1983 Quantication andSyntactic Theory Reidel Dordrecht

Fleiss Joseph L 1981 Statistical Methods forRates and Proportions John Wiley amp SonsNew York

Hobbs Jerry R and Stuart M Shieber 1987An algorithm for generating quantierscopings Computational Linguistics1347ndash63

Hornstein Norbert 1995 Logical Form FromGB to Minimalism Blackwell Oxford andCambridge

Jelinek Frederick 1998 Statistical Methods forSpeech Recognition MIT Press Cambridge

Jurafsky Daniel and James H Martin 2000Speech and Language Processing PrenticeHall Upper Saddle River New Jersey

Krippendorff Klaus 1980 Content AnalysisAn Introduction to Its Methodology Sage

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248

Page 24: A Machine Learning Approach to Modeling Scope Preferences · 2003-11-13 · ®c2003 Association for Computational Linguistics A Machine Learning Approach to Modeling Scope Preferences

96

Computational Linguistics Volume 29 Number 1

Publications Beverly Hills CaliforniaKrotov Alexander Mark Hepple Robert J

Gaizauskas and Yorick Wilks 1998Compacting the Penn treebank grammarIn COLING-ACL pages 699ndash703

Kuno Susumu Ken-Ichi Takami and YuruWu 1999 Quantier scope in EnglishChinese and Japanese Language75(1)63ndash111

Kuno Susumu Ken-Ichi Takami and YuruWu 2001 Response to Aoun and LiLanguage 77(1)134ndash143

Manning Christopher D and HinrichSchutze 1999 Foundations of StatisticalNatural Language Processing MIT PressCambridge

Marcus Mitchell P Beatrice Santorini andMary Ann Marcinkiewicz 1993 Buildinga large annotated corpus of English ThePenn Treebank Computational Linguistics19(2)313ndash330

Martin Paul Douglas Appelt andFernando Pereira 1986 Transportabilityand generality in a natural-languageinterface system In B J Grosz K SparckJones and B L Webber editors NaturalLanguage Processing Kaufmann Los AltosCalifornia pages 585ndash593

May Robert 1985 Logical Form Its Structureand Derivation MIT Press Cambridge

McCawley James D 1998 The SyntacticPhenomena of English University ofChicago Press Chicago second edition

Mitchell Tom M 1996 Machine LearningMcGraw Hill New York

Montague Richard 1973 The propertreatment of quantication in ordinaryEnglish In J Hintikka et al editorsApproaches to Natural Language ReidelDordrecht pages 221ndash242

Moran Douglas B 1988 Quantier scopingin the SRI core language engine InProceedings of the 26th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo88) pages 33ndash40

Moran Douglas B and Fernando C NPereira 1992 Quantier scoping InHiyan Alshawi editor The Core LanguageEngine MIT Press Cambridgepages 149ndash172

Nerbonne John 1993 A feature-basedsyntaxsemantics interface InA Manaster-Ramer and W Zadrozsnyeditors Annals of Mathematics and ArticialIntelligence (Special Issue on Mathematics ofLanguage) 8(1ndash2)107ndash132 Also publishedas DFKI Research Report RR-92-42

Park Jong C 1995 Quantier scope and

constituency In Proceedings of the 33rdAnnual Meeting of the Association forComputational Linguistics (ACLrsquo95)pages 205ndash212

Pedersen Ted 2000 A simple approach tobuilding ensembles of naotilde ve Bayesianclassiers for word sense disambiguationIn Proceedings of the First Meeting of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2000)pages 63ndash69

Pereira Fernando 1990 Categorialsemantics and scoping ComputationalLinguistics 16(1)1ndash10

Pollard Carl 1989 The syntax-semanticsinterface in a unication-based phrasestructure grammar In S BusemannC Hauenschild and C Umbach editorsViews of the Syntax Semantics InterfaceKIT-FAST Report 74 Technical Universityof Berlin pages 167ndash185

Pollard Carl and Eun Jung Yoo 1998 Aunied theory of scope for quantiersand WH-phrases Journal of Linguistics34(2)415ndash446

Ratnaparkhi Adwait 1997 A simpleintroduction to maximum entropy modelsfor natural language processing TechnicalReport 97-08 Institute for Research inCognitive Science University ofPennsylvania

Santorini Beatrice 1990 Part-of-speechtagging guidelines for the Penn Treebankproject Technical Report MS-CIS-90-47Department of Computer and InformationScience University of Pennsylvania

Soon Wee Meng Hwee Tou Ng and DanielChung Yong Lim 2001 A machinelearning approach to coreferenceresolution of noun phrases ComputationalLinguistics 27(4)521ndash544

van Halteren Hans Jakub Zavrel andWalter Daelemans 2001 Improvingaccuracy in word class tagging throughthe combination of machine learningsystems Computational Linguistics27(2)199ndash229

VanLehn Kurt A 1978 Determining thescope of English quantiers TechnicalReport AITR-483 Massachusetts Instituteof Technology Articial IntelligenceLaboratory Cambridge

Woods William A 1986 Semantics andquantication in natural languagequestion answering In B J GroszK Sparck Jones and B L Webber editorsNatural Language Processing KaufmannLos Altos California pages 205ndash248