decomposability and frequency in the hindi/urdu number system · 2017-07-25 · decomposability and...

Decomposability and Frequency in the Hindi/Urdu Number SystemChundra Aroor Cathcart ([email protected])

Department of Linguistics and PhoneticsLund University

Abstract

Hindi/Urdu (HU) numbers 10–99 are highly irregular, unlikethe transparent systems of most languages. I investigate themorphological decomposability of HU numbers using a seriesof computational models. While these models classify mostforms accurately, problems are encountered in high-frequencyforms of low cardinality, suggesting that some HU numbers aremore transparent (i.e., morphologically decomposable) thanothers. These results are compatible with a dual-route accessmodel proposed for the processing of numeral forms.Keywords: numerals; computational modeling; Bayesianlearning; Hindi/Urdu; phonology; morphology

IntroductionHindi/Urdu (HU, officially considered separate languages butdiffering in little other than orthography and high-register vo-cabulary) and most other modern Indic languages are unusualin that the numbers through 99 are highly opaque and irreg-ular, undoubtedly posing difficulties in production and pro-cessing for language users. This phenomenon is understud-ied, and raises interesting questions regarding the design prin-ciples of cross-linguistic number systems, as well as the pro-cessing of complex morphological forms by language users.

In this paper, I investigate the mechanisms by which HUusers extract structure and meaning from HU number terms.I address this issue using a series of unsupervised computa-tional models designed to approximate HU users’ process-ing of the numerals 10–99. While it is generally agreed thatnumber terms are acquired as individual lexical items, thereis good reason to hypothesize that many numbers above a cer-tain threshold of frequency are not accessed as individual lex-emes during processing, but rather via their component parts,in line with a dual-route model of lexical storage and access(cf. Baayen, 1993). I expect that despite the irregularity of theHU number system, users can find regular patterns in variouscues to numerical identity in the input, particularly in low-frequency numbers, thus facilitating easier comprehension.

My methodology investigates the extent to which HUnumbers can be morphologically decomposed. Brysbaert(2005) tentatively proposes a dual-route access model for En-glish numeral storage, hypothesizing that frequent, opaqueitems like twelve are accessed directly, while morphologicallytransparent numbers of lower frequency (e.g., eighty-nine) areprocessed through decomposition. In line with this view, Ipredict that less frequent HU numbers can be segmented andlabeled more accurately by a computational model, indicatinggreater morphological transparency.

I find that a model using n-grams as phonological featuressuccessfully assigns most HU numeral forms to the properTENS/DIGITS cohort, but that, rather unsurprisingly, somehighly opaque forms are misclassified. Major errors occur

among numbers of lower magnitude. Since these forms arehighly frequent, this state of affairs is compatible with a dual-route account of processing. I find that in general, the modelfaces difficulties in capturing relationships between simplex(i.e., monomorphemic) forms (e.g., /@ssi/ ‘80’) and theircomplex counterparts (e.g., /cOrAsi/ ‘84’), where a more so-phisticated model of phonology might succeed. These resultsprovide an important baseline for future investigations intomental representations of HU numerals.

BackgroundThe full list of numerals (taken from Comrie, n.d.) is given inTable 1. When encountering a datum like /b@jAlis/ ‘42’, lis-teners must infer the value of the TENS and DIGITS place withthe aid of cues in the input, and must be able to contend withhighly noisy allomorphy: TENS40 and DIGITS2 havemultiple surface realizations. In some cases, this allomor-phy is suppletive (i.e., variants bear no phonological resem-blance to each other). Listeners may possess the knowledgethat HU is head-final, and that higher-order numerical infor-mation generally occurs closer to the root (Hurford, 1987),i.e., to the right. For some numerals, it seems plausible thathigh frequency facilitates access; for instance, HU /sola/ ‘16’is quite unlike other numerals with the feature DIGITS6, allof which are /ch/-initial. This is a diachronic artifact; /sola/faithfully continues Sanskrit s. od. asa-, while other forms withDIGITS6 contain reflexes of an unattested dialectal variant*ks.(v)at.- of attested Sanskrit s. as. - ‘6’ (Turner, 1962–1966).It is also the only member of the teens which shows /l/ inits allomorph of /d@s/ ‘ten’. All the same, it may be usedfrequently enough that this twofold suppletion does not poseproblems to speakers and listeners.

A major attempt to explore synchronic regularities amongHU numbers is that of Bright (1969), who concludes thatdespite a lack of economy, implicit rules governing the sys-tem are available to language users. Berger (1992) outlinesthe complex historical development of HU numbers; spo-radic phonological reduction, analogy, and language contact,among other phenomena, have resulted in a highly irregu-lar and opaque system compared to the relatively transpar-ent numbers of Sanskrit, HU’s ancestor. These works aside,many aspects of the HU numeral system remain untreated.

Representational issuesAbstract representation of HU numeralsAbove, I adopt the canonical abstract numerical representa-tion found in much of the literature, where each surface formcomprises two underlying factors corresponding to the TENSand DIGITS place. I make the assumption that DIGITS0

1733

Table 1: HU numbers 1–99; rows represent the tens place, columns the digits place0 1 2 3 4 5 6 7 8 9

0 — ek do tin cAr pAc chE sAt Aúh nO10 d@s gjAr@ bAr@ ter@ cOd@ p@ndr@ sol@ s@tr@ @úhAr@ Unnis20 bis Ikkis bAis teis cObis p@ccis ch@bbis s@ttAis @úúAis Untis30 tis Ik@ttis b@ttis tætis cOtis pætis ch@ttis sætis @ótis UntAlis40 cAlis IktAlis b@jAlis tætAlis c@VAlis pætAlis chIjAlis sætAlis @ótAlis UncAs50 p@cAs IkjAV@n bAV@n tIrp@n c@uV@n p@cp@n ch@pp@n s@ttAV@n @úúhAV@n Uns@úh

60 sAúh Iks@úh bAs@úh tIrs@úh cOs@úh pæs@úh chIjAs@úh s@rs@úh @ós@úh Unh@tt@r70 s@tt@r Ikh@tt@r b@h@tt@r tIh@tt@r cOh@tt@r p@ch@tt@r chIh@tt@r s@th@tt@r @úhh@tt@r UnjAsi80 @ssi IkjAsi b@jAsi tIrAsi cOrAsi p@cAsi chIjAsi s@ttAsi @úúhAsi n@VAsi90 n@Ve IkjAnVe bAnVe tIrAnVe cOrAnVe p@cAnVe chIjAnVe s@ttAnVe @úúhAnVe nInjAnVe

does not map to any overt phonological information. Ad-ditionally, for forms such as /UntAlis/, there is a mismatchbetween the abstract representation TENS3 DIGITS9 andthe phonological form, since the morpheme representing thetens place closely resembles /cAlis/ ‘40’, not /tis/ ‘30’; thissuggests an intermediate calculation TENS4 DIGITS−1.I assume that the representation DIGITS−1 is an integralpart of HU numerical computation and is reflected explicitlyin the morphology.

Surface representation of HU numeralsBrysbaert hesitates to draw a categorical distinction be-tween transparent and opaque English numerals, citing semi-transparent forms like thirteen. Along these lines, I seek tosituate HU numerals along a cline between mild and extremeopacity. I quantify a number’s transparency or decomposabil-ity via the performance of a computational model designed tosegment and label HU numbers, both in terms of (1) accuracyof the labeling and (2) low posterior uncertainty.

At the outset, I lack a principled means of separating sup-pletive and non-suppletive allomorphy found in the system.Numbers 11–18 exhibit three allomorphs for TENS1, /-d@/,/-r@/ and /-l@/, all from the diachronic source -dasa-, thoughsynchronic /d/ ∼ /r/ ∼ /l/ alternations are not well knownin HU. Numbers 49–58 show multiple bases for TENS5,all descended from Sanskrit pancasat- ‘50’ but formally verydissimilar. I make no a priori assumptions about the status ofsuppletive allomorphy in the morphological system, and al-low the model to simply group together forms according tothe configuration it infers. I do, however, treat DIGITS−1as a separate morpheme, given its systematic occurrence.Nonparametric models (which assume an unbounded numberof underlying morphological labels) may alleviate some ofthe problems that result from forcing suppletive allomorphsto be classified together, which I set aside for future work.

A model of HU numerical processing should character-ize the morphological structure of the data encountered. HUnumbers are highly fusional, exhibiting the effects of millen-nia of phonological and morphological change. As Brightreports, no economical set of rules helps to derive the surfacerepresentations from their morphological bases. There is of-ten unpredictable allomorphy between simplex and complexforms of a given decade: for instance, /s/ alternates with /h/

in complex forms of /s@tt@r/ ‘70’ < Sanskrit saptatı-, but notin complex forms of /sAúh/ ‘60’ < Sanskrit s. as. t.i-; however,the latter decade’s complex forms contain a reduced vowel/@/, alternating with /A/. The short vowel and geminate con-sonant found in /@ssi/ ‘80’ alternate with a long vowel andsingleton consonant in derived /-Asi/, but the short vowelfound in /n@Ve/ does not appear in derived forms.

Despite these challenges, listeners should be able to forma probability distribution over possible morphemes containedin a complex input datum. HU’s highly fusional phonologynotwithstanding, listeners should be able to approximate thelocation of morpheme boundaries. This question is a keypart of this paper’s computational inference, and should be ofbroad interest to phonological theory, as it has the potential toincorporate a number of strategies for morphological bound-ary detection. The models introduced in this paper draw mor-pheme boundaries on the basis of what is most likely underthe current parameters of the model, and are dependent ondistributional information found in other numerals. This taskis easier than that of many types of unsupervised segmenta-tion in that at most one boundary must be located per inputdatum; however, the model must contend with a wider distri-bution of allomorphs which must be unified. This model doesnot use external distributional information for the purpose ofsegmentation (as do Harris, 1955; Saffran et al., 1996).

A question relevant to this paper concerns the types of mor-phological segmentation that should be permitted. Cross-linguistically, a morphological segmentation of the type[b][@jAlis] might be permissible, but non-inflectional mor-phemes in HU tend to consist minimally of a unit withprosodic weight. As such, I restrict the proposal distribu-tion for segmentations of HU numbers to exclude morphemeboundaries following the first and penultimate segments; thisadditionally speeds up inference and ensures that short formslike /bis/ ‘20’ will be treated as monomorphemic.

Phonological featuresThe methodology developed in this paper must capture al-lomorphy in the HU numeral system, inferring that differ-ent surface strings such as /-jAlis/ and /cAlis/ correspondto the same underlying morpheme, TENS4. I cluster al-lomorphs together on the basis of phonological features us-ing essentially the same likelihood formula used in unsuper-

1734

vised Naıve Bayes/Dirichlet-Multinomial classifiers, popularin bag-of-words models of document classification. This is amodel of convenience which I find fairly effective, though itis admittedly crude; it is insensitive to positional informationand alternations, and does not strongly penalize the absenceof potentially crucial morpheme-level information.

The model described in this paper lends itself to the useof different types of phonological features, and provides op-portunities to investigate model performance under featureswith differing degrees of abstraction. In this paper, I limitmyself to domain-general string-based features, namely n-grams. In many contexts, I expect segmental bigrams tofare well in assigning cohort membership. However, thereare some cases where I fear that bigrams may fail to capturealternations caused by (among other things) deletion or in-sertion, as between the base /n@Ve/ ‘90’ and /-nVe/, foundin complex forms. A unigram model will be sensitive to theco-occurrence of /n/ and /V/, whereas a representation con-sisting solely of bigrams will not (on the use of separate au-tosegmental tiers for consonants and vowels, see Goldsmith& Riggle, 2012). I attempt to circumvent this problem with aphonological representation that uses both unigrams and bi-grams (though this technically violates the independence as-sumption of Naıve Bayes). This allows the model to capturesome similarities between paradigmatically related forms thatwould otherwise be lost in a strict bigram model.

ModelHere, I introduce the core model employed in this paper, de-signed to approximate a HU speaker’s recognition of numbers10–99 (I assume that 1–9 are primitives). When encounteringa numerical form, the listener must determine whether it issimplex or complex. If simplex, the value of the TENS placemust be inferred; if complex, the DIGITS place must be aswell. The model assumes that a complex form is generatedby independent draws from two mixtures, a DIGITS mixture(the labels of which correspond to the values −1,1, ...,9)and a TENS mixture (the labels of which correspond to thevalues 1, ...,9). Because HU morphology is generally con-catenative, I make the simplifying assumption that phonolog-ical elements generated by a given mixture are adjacent toone another — i.e., that a morpheme boundary can be lo-cated somewhere in a complex form, however approximately.I make the assumption that the lefthand morpheme is gen-erated by the digits mixture and the righthand morpheme isgenerated by the tens mixture; this convention essentially in-corporates Hurford’s insight that higher numerical elementsoccur closer to the root, which in turn can be interpreted asprior knowledge of a morphosyntactic headedness parameter.This system of numerical classification is schematized in Fig-ure 1.

InferenceThis paper’s basic model of numeral classification assignseach form to one or two mixtures, given a 10× F matrixΩ

D and a 9×F (where F is the number of feature types in

DIGITS2

b@j|AlisTENS4

Figure 1: Schema of a proposed morphological segmentation,tens classification, and digits classification for form /b@jAlis/

the input) matrix ΩT (specifying a prior over feature distri-

butions associated with each label of the DIGITS and TENSplace, respectively), as well as a word-level vector µ repre-senting a prior over morpheme boundary locations. I initial-ize these matrices with symmetric concentration parametersαT αD,αµ, set to .1 in order to encourage sparseness, suchthat unshared features from unrelated labels are not clumpedtogether. The generative model draws probability simplicesφ

Dj ∼Dirichlet(ωD

j ),φTi ∼Dirichlet(ωT

i ) representing the fea-ture distributions associated with levels j and i of the DIGITSand TENS place, and assumes that for every word w,

ς ∼ Dirichlet(µ) (a simplex of morpheme boundary proba-bilities is drawn, including the probability p(m = /0), i.e.,the probability that there is no morpheme boundary)

m∼Categorical(ς) (a morpheme boundary is drawn from ς)If m = /0,

zDj = 0

for each feature f ∈ wf ∼ Categorical(φT

i ), i ∈ 1, ...,9If m 6= /0,

For each feature f ∈ w1,...,m (through index m)f ∼ Categorical(φD

j ), j ∈ −1,1, ...,9For each feature f ∈wm+1,...,|w| (from index m+1 throughthe end of the word)

f ∼ Categorical(φTi ), i ∈ 1, ...,9

I marginalize out the parameters ς,φTi ,φ

Dj to ob-

tain collapsed Dirichlet-Categorical updates forp(m|µ), p(zT |ΩT ), p(zD|ΩD). For a given word, this yieldsthe following conditional probability if m = /0 (adopted fromYin & Wang, 2014):

P(m = /0,zTi ,z

D = 0|zT−i,z

D−0,Ω

T ,ΩD,µ) ∝

∏ f∈w ∏c( f )wn=1 c( f )−w

zTj+αT +n−1

∏|w|k=1 c(·)−w

zTj+FαT + k−1

(1)

If m 6= /0:

P(m,zTi ,z

Dj |zT−i,z

D− j,Ω

T ,ΩD,µ) ∝

∏ f∈λlm

∏

c( f )λl

mn=1 c( f )−w

zDj+αD +n−1

∏mk=1 c(·)−w

zDj+FαD + k−1

·∏ f∈λrm ∏

c( f )λrm

n=1 c( f )−wzTi+αT +n−1

∏|w|−mk=1 c(·)−w

zTi+FαT + k−1

(2)

1735

Above, c( f )−wzTi

denotes the number of instances of f currently

associated with label zTi , and c(·)−w

zTi

the number of instances

of any item currently associated with label zTi (both terms

exclude any instances contributed by w); c( f )σ signifies thenumber of instances of f in element σ. For simplicity, I writeλl

m for w1,...,m and λrm for wm+1,...,|w|. Once new values of

m,zTi ,z

Dj are chosen for word w, counts for the features in λl

m

(if m 6= /0) and λrm can be allocated to zD

j and zTi , respectively.

Priors on morphological segmentationFor this paper’s most basic inference procedures, the priorover morpheme boundaries is symmetric, with equal proba-bility allocated to all possible segmentations of word w. Incertain inference regimes, I employ one of two priors on seg-mentation incorporating the principle of Minimum Descrip-tion Length, popular in unsupervised morphological segmen-tation (Goldsmith, 2001; Creutz & Lagus, 2007); these priorsfavor the insertion of morpheme boundaries which minimizethe length of the code that generates the data. There are anumber of ways to interpret this principle. Most intuitively,the “code” can be construed as the list of morph types, or al-ternatively, the sum of the lengths of morph types. Hence, anMDL or exponential prior on morphological segmentationsdisfavors analyses that add to the list, or the sum of (string)lengths of types in the list.

The first prior (MDL1), designed to keep the list of an-alyzed morphs short, assigns probability to a morphologi-cal segmentation for word w proportional to the inverse ofthe number of morph types as currently analyzed, includingthe proposed segmentation for w; under the second approach(MDL2), the prior probability is inversely proportional to thesum of lengths of current morph types. I have employed thesepriors due to the importance of MDL in the literature on unsu-pervised segmentation, but remain somewhat skeptical as towhether HU numeral morphology can be rendered compactin the same manner as the morphology traditionally analyzedwith MDL priors (e.g., of English, Finnish, Turkish, etc.),given the noisy allomorphy seen.

Priors on cluster membershipReaders may note that the above formulae depart from tradi-tional Dirichlet-Multinomial mixture models in that the Chi-nese Restaurant Process prior (a rich-get-richer scheme) overcluster membership is excluded. This prior, which makes itmore likely for an item to be assigned to a cluster that al-ready has many data points, seems inappropriate for this pa-per’s model, which iterates over one token of each number,and should learn classes of roughly equal size. In one sam-pling regime, I place an exponential prior on TENS and DIG-ITS label membership, inversely proportional to the numberof items currently assigned to the label in question (plus aconcentration parameter). The intention here is to introducea pressure toward clusters of uniform size.

Inference procedureInference is carried out via Markov chain Monte Carlo. I rundifferent versions of the model on three chains for 10000 iter-ations, discarding the first half of samples as burn-in. Eachchain is initialized by randomly segmenting and assigningeach item to a TENS and DIGITS label. Parameters are up-dated via Gibbs Sampling; for each number in 10–99, a mor-phological segmentation m, a TENS label zT and if relevant,a DIGITS label zD are drawn conditional on the labels cur-rently assigned to all other data points (see eqq. 1–2). I use asimulated annealing procedure, raising each vector of updateprobabilities to the power of a constant 1

γ, with γ decreasing

from 10 to 1 over the course of the burn-in. Code can befound at github.com/chundrac/HUnumerals.

I carry out an inference procedure using only bigrams as aphonological feature representation (2g); this is followed bya regime using unigrams and bigrams (1+2g). I modify the1+2g procedure to incorporate an MDL prior sensitive to thelength of the current list of morph types (MDL1), followed byan MDL prior sensitive to the sum of their lengths (MDL2).I attempted to see how the MDL1 prior (which showed betterperformance) affected the bigram model. Additionally, I rana simulation which augmented the 1+2g/MDL1 model with aprior over component membership designed to keep clustersuniform (denoted by U).1

ResultsI use the overall F-measure (Fung et al., 2003) and the V-measure (Rosenberg & Hirschberg, 2007), two evaluationmetrics designed to quantify the similarity between two clas-sifications, in order to monitor convergence and measureoverall accuracy (convergence was also assessed via chainlog-likelihoods). I compute pairwise F- and V-measures be-tween the maximum a posteriori (MAP) configuration of eachchain to assess the degree to which chains return the sameclassification, interpreting values greater than .9 as a tokenof convergence between two chains. I evaluate each chain’saccuracy by computing the F- and V-measures between thechain’s MAP configuration and the true classification of thenumbers. These values are found in Table 2. In general, MDLpriors do not appear to improve inference for bigrams, and donot significantly improve inference for 1+2grams.

Table 3 displays the MAP configuration for the topchain (2) in the regime with highest overall accuracy(1+2g/MDL1/U). To measure the ACCURACY with whichthis regime decomposes individual numbers, I calculate theF-scores for each number’s MAP TENS and DIGITS classifica-tions with respect to its true TENS and DIGITS classifications,averaging these values. The resulting values are then aver-aged across chains. I calculate POSTERIOR UNCERTAINTY

1I also experimented with a procedure that excluded anyTENS/DIGITS pairs from the proposal distribution for a given formthat were assigned to any previous forms within a window of arbi-trarily chosen size. However, this exacerbated the label-switchingproblem (a trivial issue); less trivially, it was difficult to motivate awindow size which plausibly paralleled working memory.

1736

by averaging the entropy of the posterior sample (comprisingblocked draws of m,zT ,zD) for each chain.

I extract numeral frequencies from the EMILLE HindiWebnews corpus (Baker et al., 2002). For each number, ac-curacy and posterior uncertainty are plotted according to fre-quency in Figure 2, along with correlation coefficients andp-values. Both correlations are significant (albeit noisy), pro-viding support for the idea that the HU numbers can be pro-cessed via a dual-route model. As seen in the lefthand plot,the majority of HU numbers occupy a quasi-Pareto frontier,indicating an efficient trade-off between decomposability andfrequency. Several numbers in the teens (seen in the upperrighthand corner of the plot) are both highly frequent and de-composable. These outliers in no way contradict the dual-route model, since a form’s decomposability does not pre-clude the possibility that it is stored whole. However, a hand-ful of numbers are found beneath the frontier (near the lowerlefthand corner), meaning that they are both relatively infre-quent and difficult to parse. These items can be viewed asvulnerable points in the grammar of HU numbers, and maybe prone to “leakage” or analogical restructuring.

Table 2: F-/V-measures for different inference regimesTEN DIG TEN DIG TEN DIG

convergence chain 1–2 chain 1–3 chain 2–32g .88/.87 .77/.74 .86/.86 .81/.78 .92/.91 .95/.932g,/MDL1 .77/.80 .86/.82 .78/.81 .85/.84 .87/.84 .90/.881+2g .88/.86 .89/.88 .90/.88 .90/.89 .99/.98 .99/.991+2g/MDL1 .93/.89 .95/.95 .94/.91 .95/.95 .99/.98 1/11+2g/MDL2 .94/.92 .95/.94 .94/.92 .95/.94 1/1 1/11+2g/MDL1/U .88/.87 .92/.92 .89/.89 .92/.92 .97/.96 1/1over. accuracy chain 1 chain 2 chain 32g .81/.81 .76/.74 .82/.84 .86/.84 .87/.88 .92/.892g/MDL1 .78/.79 .82/.80 .80/.82 .89/.88 .80/.81 .85/.831+2g .87/.86 .89/.87 .91/.91 .90/.88 .91/.91 .92/.891+2g/MDL1 .90/.89 .88/.86 .91/.91 .9/.88 .91/.91 .9/.881+2g/MDL2 .88/.87 .90/.88 .91/.91 .9/.87 .91/.91 .9/.871+2g/MDL1/U .91/.89 .9/.9 .93/.91 .92/.89 .91/.9 .92/.89

Table 3: MAP configuration for 1+2g/MDL1/U, chain 2.Rows represent tens classification; columns represent digitsclassification. Numbers are represented by cardinality forreadability. Asterisks (∗) mark numbers where the numericalrepresentation TENSi, DIGITS9 maps to the representa-tion TENSi+1, DIGITS−1

35 34 31 29∗ 32 36 38 3075 77,

7074 71 73 69∗ 72 76 78

15 17,16

14 13 12 18 11 10

65 67 64 61 63 59∗ 62 66 68 6044,45

47,27

40 41 43 39∗,49∗

42 46 28,48

25 37 24 21 33,23

19∗ 22 26 20

95 97 94 91 93 92 96 98 90,99

55 57 54 51 53 52 56 5850,85

87 84 81 83 79∗ 82 86 88,80

89

DiscussionThe models presented in this paper show that although HUnumerals 10–99 are morphologically irregular, a large num-ber can be classified according to their component parts.However, quite a few forms are difficult to decompose, mostof them of low magnitude and high frequency. In general, themodels handled some types of allomorphy well, and others

4.0 5.0 6.0 7.0

0.0

0.2

0.4

0.6

0.8

1.0

log frequency

accuracy

10

11

121314 15

16

1718

19

20

21222324

25

26

2728

29

30

3132

33

34 3536

37

3839

40

414243

44

4546 4748

49

50

51525354555657 58

59

60

61 626364 6566

67

6869

70

7172 7374 7576777879

80

81828384 858687 88

89

90

91 929394 95 969798

99

4.0 5.0 6.0 7.0

0.0

0.2

0.4

0.6

0.8

1.0

log frequency

post

erio

r ent

ropy

10

11

1213

14

1516

17

18

19

20

212223

24

25

26

27

28

29 30

31

32

33

34 35

36373839

40

4142

43

44

4546

47

48

49

50

51525354

5556

5758

59 6061 626364 65

6667

6869

70

7172 7374 75767778

7980

81

82

8384 85

86

87 8889 9091 929394 9596

9798

99

Figure 2: Log frequency as a predictor of model accuracy(ρ=−.55, p< .001) and post. uncertainty (ρ= .38, p< .001)

poorly. Forms containing the TENS5 allomorphs /-p@n/ ∼/-V@n/ are grouped together, due to their agreement in two outof three segments. Surprisingly, /sol@/ ‘16’ was recognizedas a member of the teens, despite the unique allomorph /-l@/;however, the model failed to properly classify it accordingto its digits place. Other forms with highly suppletive allo-morphy (e.g., /UncAs/ ‘49’) were misclassified. Additionally,many simplex forms were not analyzed as monomorphemic,unless only a monomorphemic analysis was permitted underthe proposal distribution.

As stated above, my results show that the HU numeral sys-tem’s design is largely compatible with a dual-route model ofaccess. In general, high-frequency items were more difficultfor a computational model to decompose, indicating greateropacity. (Berger shows that many of these numbers were his-torically subject to erosion and evidently resistant to analog-ical changes that would otherwise make them more transpar-ent and perceptually distinct.) At the same time, there are ex-ceptions to this generalization: certain high-frequency itemsin the teens showed high accuracy, though this does not ruleout the possibility that they are stored whole. Additionally,some problematic items are more opaque than would be ex-pected, given their low frequency. It is likely that such vul-nerable forms cause problems in planning and production.

The EMILLE Spoken Hindi corpus contains intrigu-ing numeral variants (e.g., /iúhjAnVe/ ‘91’ by speakerehinsp041, /sIntIjAnVe/ ‘97’ by ehinsp035, /UnAnVe/ ‘89’ byehinsp044), though the data are too sparse to serve as thebasis of a rigorous quantitative study. Many numbers aremissing in the corpus; furthermore, the variation observedmay stem from sources other than production difficulty, in-cluding transcriber error, multilingualism (with another In-dic language; for example, speaker ehinsp017 utters the form/bAvis/ ‘22’, standard in Marathi but not HU), and stylis-tic factors. Studies of variability in the production of HUnumerals — either in experimental contexts or naturalisticspeech — will serve as a valuable research direction, par-ticularly with an eye to whether vulnerable forms (i.e., sub-

1737

optimal forms with higher opacity than expected relative tofrequency) are subject to greater instability.

ConclusionIn this paper, I have employed a simple and somewhat crudemodel of allomorphy, inspired in part by bag-of-words mod-els used in document classification and intended to serve as abaseline for future work. A goal of this study was to test thelimits of a simple mixture model in a HU numerical recogni-tion task. A more sophisticated model of phonological pro-cesses may relate potential allomorphs to each other in termsof edits, as has been done in some MDL approaches (Virpiojaet al., 2010). However, while such models can contend withor recover relatively regular allomorphy, no model has beendesigned, to my knowledge, to capture the highly noisy allo-morphy found in the HU numeral system.

A true test of any computational model’s value is in howwell it agrees with human performance. A future direction forthis work will involve carrying out experimental research tosee how HU speakers process and produce numerical forms.It will serve us well to see how model inaccuracy fares asa predictor of greater response latency in psycholinguistictasks. A joint approach which considers limitations in bothexperimental performance and computational simulation willhelp us identify weak points in this and other complex mor-phological systems that can potentially (though not obligato-rily) undergo analogical change.

I have shown that frequency may facilitate the processingof more opaque HU numbers, but the question remains asto why most Indic number systems are on average more ir-regular than exact number systems found in other languages.Sociocultural factors may be partially responsible,2 and theirrole in shaping cross-lingustic number systems should betaken into account along with that of functional need (cf. Xu& Regier, 2014).

AcknowledgementsI am grateful to Stephan Meylan, Terry Regier, and threeanonymous reviewers for helpful comments. All errors andinfelicities are my own responsibility.

ReferencesBaayen, R. H. (1993). On frequency, transparency and pro-

ductivity. In G. Booij & J. van Marle (Eds.), Yearbook ofmorphology 1992 (p. 181-208). Dordrecht: Kluwer.

Baker, P., Hardie, A., McEnery, T., Cunningham, H., &Gaizauskas, R. (2002). EMILLE, a 67-million word corpusof Indic languages: Data collection, mark-up and harmon-isation. In Proc. LREC 2002 (p. 819-825).

2It is well known that South Asia is home to rigid societal hierar-chies, and historically, exact number systems may have been the pre-serve of elite groups, while marginal groups relied on an alternativesystem (prior to language standardization). While this theory is bla-tant speculation, it is worth briefly noting that Sinhala, an Indic lan-guage whose speakers chiefly practice Buddhism (which preaches adoctrine of egalitarianism), developed a transparent number system.

Berger, H. (1992). Modern Indo-Aryan. In J. Gvoz-danovic (Ed.), Indo-European numerals (Vol. 57, p. 243-287). Berlin: Walter de Gruyter.

Bright, W. (1969). Hindi numerals. Working Papers in Lin-guistics (University of Hawaii), 9, 29-47.

Brysbaert, M. (2005). Number recognition in different for-mats. In J. I. Campbell (Ed.), Handbook of mathematicalcognition (p. 23-42). New York, Hove: Psychology Press.

Comrie, B. (n.d.). Typology of numeral systems. Avail-able at https://mpi-lingweb.shh.mpg.de/numeral/TypNumCuhk 11ho.pdf. Accessed 1 October 2016.

Creutz, M., & Lagus, K. (2007). Unsupervised models formorpheme segmentation and morphology learning. ACMTransactions on Speech and Language Processing, 4, 1-34.

Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchicaldocument clustering using frequent itemsets. In Proceed-ings of the SIAM International Conference on Data Min-ing.

Goldsmith, J. (2001). Unsupervised learning of the mor-phology of a natural language. Computational Linguistics,27(2), 153-198.

Goldsmith, J., & Riggle, J. (2012). Information theoreticapproaches to phonological structure: the case of Finnishvowel harmony. Natural Language and Linguistic Theory,30, 859-896.

Harris, Z. S. (1955). From phoneme to morpheme. Language,31(2), 190-222.

Hurford, J. R. (1987). Language and number: The emergenceof a cognitive system. Oxford: Blackwell.

Rosenberg, A., & Hirschberg, J. (2007). V-measure: A con-ditional entropy-based external cluster evaluation measure.In Proceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing and Compu-tational Natural Language Learning (p. 410-20). Prague:Association for Computational Linguistics.

Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996). Wordsegmentation: The role of distributional cues. Journal ofMemory and Language, 35, 606-21.

Turner, R. L. (1962–1966). A comparative dictionary of theIndo-Aryan languages. London: Oxford University Press.

Virpioja, S., Kohonen, O., & Lagus, K. (2010). Unsupervisedmorpheme analysis with Allomorfessor. In C. Peters et al.(Ed.), CLEF 2009 Workshop, Part I (Vol. 6241, p. 609-616). Heidelberg: Springer.

Xu, Y., & Regier, T. (2014). Numeral systems across lan-guages support efficient communication: From approxi-mate numerosity to recursion. In P. Bello, M. Guarini,M. McShane, & B. Scassellati (Eds.), Proceedings of the36th annual meeting of the Cognitive Science Society.

Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mix-ture Model-based approach for short text clustering. InKDD ’14: Proceedings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Min-ing (p. 233-242). New York: ACM.

1738

decomposability and frequency in the hindi/urdu number system · 2017-07-25 · decomposability and...

Documents