multiple object agreement morphemes in setswana: a computational approach

16
Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 Printed in South Africa — All rights reserved Southern African Linguistics and Applied Language Studies is co-published by NISC (Pty) Ltd and Routledge, Taylor & Francis Group Copyright © NISC (Pty) Ltd SOUTHERN AFRICAN LINGUISTICS AND APPLIED LANGUAGE STUDIES ISSN 1607-3614 EISSN 1727-9461 http://dx.doi.org/10.2989/16073614.2012.737598 Multiple object agreement morphemes in Setswana: A computational approach Rigardt Pretorius 1 , Ansu Berg 1 and Laurette Pretorius 2* 1 Department of Setswana, School of Languages, North-West University, Potchefstroom Campus, Private Bag X6001, Potchefstroom 2520 2 School of Interdisciplinary Research and Graduate Studies, College of Graduate Studies, University of South Africa, PO Box 392, UNISA, Pretoria 0003 *Corresponding author, e-mail: [email protected] Abstract: Setswana is an agglutinative language where prefixes and suffixes are extensively used in the formation of words. Words such as verbs, pronouns, adjectives and so on, which have a grammatical relationship with nouns in sentences, demonstrate agreement with such nouns by means of agreement morphemes. In certain instances verbs in Setswana sentences may take two objects. Both of these objects may be represented in the verb by object agreement morphemes. The result is that two object agreement morphemes may be prefixed to the verb. While the morphemes of the verb are presented systematically in Setswana grammars, the occurrence of multiple object agreement morphemes has received limited attention in the literature on Setswana linguistics. Similarly, this phenomenon has not yet been investigated from a computational morpho- logical point of view. This article reports on (i) an example-based investigation towards a better and more complete understanding of the phenomenon of multiple object agreement morphemes as they appear in Setswana verbs, (ii) the modelling of these morphemes in an existing finite state tokeniser and computational morphological analyser for Setswana, and (iii) the novel role that a morphological analyser and its guesser variant can play in a corpus-based investigation of the phenomenon under discussion. Introduction Setswana is one of the 11 official languages of South Africa. Setswana is also the national language in Botswana while English is their official business language (Central Statistic Office, 2009: 339). It is a Bantu language in the South Eastern zone and is one of the Sotho-Tswana languages, the others being Sesotho sa Leboa (Northern Sotho) and Sesotho (Southern Sotho) (Cole, 1955: xv–xix; Cole,1961: 80–96). It is well known that the Sotho languages have SVO basic word order, are predominantly head-marking, have articulated noun class systems and complex verbal morphology (including a number of valency changing suffixes, sometimes called ‘extensions’) with surface word order often determined by discourse pragmatics and information structure (see, for example, Louwrens, 1994: 192; Marten et al., 2007: 255). Setswana is an agglutinating language (Nurse, 2008: 28). Indeed, in many Bantu languages, including Setswana, ‘some of the ‘formative elements’ (prefixes and suffixes) can no longer be used separately, and sometimes we even find internal changes in a word, comparable to those by which in English we form the plural of a noun like foot or the past of the verb like run’ (Doke, 1955: 1). Setswana is considered a lesser-studied and under-resourced language, both from a linguistic and a computational point of view. The challenge therefore remains to continue investigating Setswana, and the Bantu languages in general, linguistically as well as computationally, in terms of their radically different structure when compared to Indo-European languages. It should be acknowledged that systems of grammatical analysis and terminology, as applied to Indo-European languages, are not always suitable for the analysis of Bantu languages and may sometimes even be inadequate (Cole, 1955: xxix). Two examples of such differences that

Upload: laurette

Post on 01-Mar-2017

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218Printed in South Africa — All rights reserved

Southern African Linguistics and Applied Language Studies is co-published by NISC (Pty) Ltd and Routledge, Taylor & Francis Group

Copyright © NISC (Pty) Ltd

SOUTHERN AFRICAN LINGUISTICSAND APPLIED LANGUAGE STUDIES

ISSN 1607-3614 EISSN 1727-9461http://dx.doi.org/10.2989/16073614.2012.737598

Multiple object agreement morphemes in Setswana: A computational approach

Rigardt Pretorius1, Ansu Berg1 and Laurette Pretorius2*

1Department of Setswana, School of Languages, North-West University, Potchefstroom Campus, Private Bag X6001, Potchefstroom 2520

2 School of Interdisciplinary Research and Graduate Studies, College of Graduate Studies, University of South Africa, PO Box 392, UNISA, Pretoria 0003

*Corresponding author, e-mail: [email protected]

Abstract: Setswana is an agglutinative language where prefixes and suffixes are extensively used in the formation of words. Words such as verbs, pronouns, adjectives and so on, which have a grammatical relationship with nouns in sentences, demonstrate agreement with such nouns by means of agreement morphemes. In certain instances verbs in Setswana sentences may take two objects. Both of these objects may be represented in the verb by object agreement morphemes. The result is that two object agreement morphemes may be prefixed to the verb. While the morphemes of the verb are presented systematically in Setswana grammars, the occurrence of multiple object agreement morphemes has received limited attention in the literature on Setswana linguistics. Similarly, this phenomenon has not yet been investigated from a computational morpho-logical point of view.

This article reports on (i) an example-based investigation towards a better and more complete understanding of the phenomenon of multiple object agreement morphemes as they appear in Setswana verbs, (ii) the modelling of these morphemes in an existing finite state tokeniser and computational morphological analyser for Setswana, and (iii) the novel role that a morphological analyser and its guesser variant can play in a corpus-based investigation of the phenomenon under discussion.

IntroductionSetswana is one of the 11 official languages of South Africa. Setswana is also the national language in Botswana while English is their official business language (Central Statistic Office, 2009: 339). It is a Bantu language in the South Eastern zone and is one of the Sotho-Tswana languages, the others being Sesotho sa Leboa (Northern Sotho) and Sesotho (Southern Sotho) (Cole, 1955: xv–xix; Cole,1961: 80–96). It is well known that the Sotho languages have SVO basic word order, are predominantly head-marking, have articulated noun class systems and complex verbal morphology (including a number of valency changing suffixes, sometimes called ‘extensions’) with surface word order often determined by discourse pragmatics and information structure (see, for example, Louwrens, 1994: 192; Marten et al., 2007: 255). Setswana is an agglutinating language (Nurse, 2008: 28). Indeed, in many Bantu languages, including Setswana, ‘some of the ‘formative elements’ (prefixes and suffixes) can no longer be used separately, and sometimes we even find internal changes in a word, comparable to those by which in English we form the plural of a noun like foot or the past of the verb like run’ (Doke, 1955: 1).

Setswana is considered a lesser-studied and under-resourced language, both from a linguistic and a computational point of view. The challenge therefore remains to continue investigating Setswana, and the Bantu languages in general, linguistically as well as computationally, in terms of their radically different structure when compared to Indo-European languages.

It should be acknowledged that systems of grammatical analysis and terminology, as applied to Indo-European languages, are not always suitable for the analysis of Bantu languages and may sometimes even be inadequate (Cole, 1955: xxix). Two examples of such differences that

Page 2: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius204

are relevant for this article are Setswana’s rather idiosyncratic orthography and its implications for tokenisation, particularly with respect to verbs, and the (morphological) phenomenon of subject and object agreement morphemes in verb constructions, as they occur in the broader syntactic context of Setswana nominal classification and concordial agreement.

Disjunctive orthography and tokenisationThe disjunctive orthography employed in Setswana verbs, where verbal prefixes are usually written disjunctively while verbal suffixes are written conjunctively, is illustrated in examples (1a) to (1c). This phenomenon, among others, influences the tokenisation of verbs in the language as orthographic words are not necessarily linguistic words (Pretorius R et al., 2009: 66). The distinc-tion between orthographic and linguistic word is treated in more detail by, among others, Louwrens (1994: 211–212), Louwrens and Poulos (2006: 389–401) and Kosch (2006: 4). Kosch (2006) defines an orthographic word as ‘a sound or sequence of sounds separated from other sounds or sequences of sounds by means of spaces in writing’ and a linguistic word as ‘a unit which has its own independent meaning. Structurally it contains at least a root. In addition to the root it may contain another root (or roots) and/or affixes’.

The English sentences in Examples (1a), (1b) and (1c) consist of six, four and four linguistic words, respectively, while the Setswana equivalents contain three, three and two linguistics words each. We use the notational convention 1.woman to indicate that the Setswana word mosadi ‘woman’ belongs to Noun Class 1 (see also Marten et al., 2007). A complete list of the tags used in the examples and their meanings are provided in the appendix.

(1a) ‘The woman will read the book.’ The / woman / will / read / the / book. Mosadi o tla bala buka. Mosadi / o tla bala / buka. 1.woman/ AgrSubj-Cl1 FUTtense read VerbEnd/ 9.book(1b) ‘She read the book.’ She / read / the / book. Ene o badile buka. Ene / o badile / buka. she/ AgrSubj-3P-Sg read Perf VerbEnd/ 9.book(1c) ‘She will help him.’ She / will / help / him. Ene o tla mo thusa. Ene / o tla mo thusa. she/ AgrSubj-3P-Sg FUTtense AgrObj-3P-Sg help VerbEnd

In Example (1c) it is important to note that the object agreement morpheme (mo-) is generally used in Setswana instead of the personal pronoun (for the third person singular – him/her) that is used in English. We return to the issue of tokenisation in the main section on Setwana tokenisation and computer verb morphology.

Nominal classification and morpheme agreementWords such as pronouns, adjectives and so on have a grammatical relationship with nouns as they show agreement with noun class prefixes. In Setswana verbs this agreement manifests as subject and object agreement morphemes (Cole, 1955; Krüger, 2006). While a detailed exposition of Setswana morphology falls outside the scope of this article, the interested reader is referred to Cole (1955) and Krüger (2006), in which, inter alia, the complexities of verb morphology are presented systematically. The phenomenon of multiple object agreement morphemes has, however, even in these standard references, received limited attention.

Page 3: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 205

Computational challengesFor well-resourced languages it has become customary and even a best practice for investigations into linguistic phenomena to be enhanced and extended by the resources, technologies and tools that characterise and are offered by corpus linguistics. Such resources include machine-readable lexicons and large electronically available language corpora, both raw/unannotated and annotated. Corpora can in principle provide three different kinds of data, viz. empirical support, frequency information and meta-information, including author information, text origin information, text genre, and so on (Lüdeling & Kytö, 2008: ix). However, in order to make optimal use of electronic corpora, suitable technologies and tools should be available in order to explore and exploit the wealth of different kinds of corpus data, as mentioned above. Basic enabling technologies and tools include, for example, tokenisers, morphological analysers, chunkers and tools for shallow parsing, computa-tional grammars and syntactic parsers, and so on. For a detailed exposition of the resources and technologies that are considered essential for the technological development of a language the interested reader is referred to (Maegaard et al., 2006). It is well known that lesser-studied under-resourced languages mostly do not have a full complement of these mentioned resources, technol-ogies and tools, the development of which are in themselves resource intensive (in terms of human, financial and language resources).

A corpus-based approach to the present study would require at least a suitable (in terms of quality, size and focus) corpus, a tokeniser and a morphological analyser for the language under investigation. For Setswana a tokeniser and a morphological analyser have been developed and will be further discussed in subsequent sections. Various efforts are currently underway to develop general corpora for the official Bantu languages of South Africa. However, it is our contention that the study of multiple object agreement morphemes may require a special purpose corpus since the phenomenon is not common and its occurrence may be limited to or more prevalent in specific genres. In this study, the focus is on the role that the tokeniser and morphological analyser can play in investigating multiple object agreement morphemes in Setswana. An investigation into the validity of our above contentions and the development of a suitable corpus, fall outside the scope of this article and form part of future work.

This article, therefore, aims to address the topic of multiple object agreement morphemes in Setswana by focusing on the following three aspects: (i) an example-based investigation towards a better and more complete understanding of the phenomenon of multiple object agreement morphemes as they appear in Setswana verbs; (ii) the modelling of these morphemes in an existing tokeniser and finite state computational morphological analyser for Setswana; and (iii) the novel role that a morphological analyser and its guesser variant may play in a corpus-based investigation of the phenomenon under discussion.

The second section, titled ‘Multiple object agreement morphemes’, starts with a brief overview of the relevant scientific literature and then discusses the salient aspects of the (multiple) object agreement morpheme(s) in Setswana verb constructions by means of a number of pertinent examples. The third section focuses on how the phenomenon under consideration is computationally modelled by means of finite state tokenisation, with specific reference to the disjunctively written verb constructions. This is followed by an overview of the core features and capabilities of an existing finite state morphological analyser for Setswana of relevance for investigating the occurrence of multiple object agreement morphemes in Setswana. The application of the tokeniser and morphological analyser is demonstrated by applying them to the test suite of examples provided in the second section. The novel role that the morpholog-ical analyser and its guesser variant can play in a corpus-based investigation of the phenom-enon under discussion is the topic of the fourth section. A proof-of-concept demonstration of this approach is also provided in the form of an application of the proposed approach to the test suite of examples in the first two sections of the paper together with selected examples from the results. The article is concluded in a final section, which also alludes to possible future work, including a small/medium scale corpus application, for which a suitable Setswana corpus still needs to be developed.

Page 4: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius206

Multiple object agreement morphemesIn dealing with the aspect of object agreement in Setswana we follow a descriptive approach in which a systematic exposition of instances of occurrence of multiple objects, as they may appear in the language, is presented. The basis for description is the structural/taxonomic approach as employed by various Bantu theorists such as Doke (1955) , Cole (1955, 1961), Van Wyk, Posthumus and Krüger (2001, 2006). The basis for the description is directly informed by the work of Marten et al. (2007) in which they attempted to provide a framework within which major topics in micro-variation in Bantu may be systematically described and studied.

Morphosyntactic parameters for object agreement morphemes in SetswanaMarten et al. (2007) developed as set of 19 parameters to serve as basis for cross-linguistic comparison of 10 Bantu languages from the South Eastern zone. Setswana (Guthrie classification number S31) is one of the languages that they include in their study. These parameters are ‘concerned with many major topics in Bantu grammar such as object relations, double objects, and agreement’ (Marten et al., 2007: 257). The 14 parameters are grouped into six topics, two of which are further divided into sub-parame-ters, resulting in 19 parameters. The first two topics deal with object markers and double objects. The parameters under these two topics are applied to Setswana by means of pertinent examples.

Topic one: Object markersParameter 1: OM – obj NP: Can the object marker and the lexical object NP co-occur?

In Setswana the object agreement morpheme and the object usually do not occur together within the same phrase. However, there are discourse circumstances as presented in parameter 2 where the co-occurrence of an object marker and a dislocated object is possible.

(2a) Ke bone basimane kwa toropong. (‘I saw the boys in town.’) Ke bone / basimane / kwa / toropong. AgrSubj-P1-Sg see Perf VerbEnd/ 2.boys/ in/ 9.town(2b) Ke ba bone / kwa / toropong. (‘I saw them in town.’) AgrSubj-P1-Sg AgrObj-Cl2 see Perf VerbEnd/ in/ 9.town

Parameter 2: OM obligatory: Is co-occurrence obligatory in some contexts?The object agreement morpheme does not usually occur with the object in Setswana sentences,

however co-occurrence is possible. When the object agreement morpheme and the object co-occur this object is seen as an appositional member in the structural approach to syntax. An appositional member is a functional class that re-defines or describes a head member in more detail. It therefore adds some additional information to the content of the head member (Krüger, 2001: 47). Within the information structure it is acknowledged that these objects might be added for discourse-pragmatic reasons (Zerbian, 2006: 57).

(3a) Ke bone mosimane yo kwa toropong. (‘I saw this boy in town.’) Ke bone / mosimane / yo / kwa / toropong. AgrSubj-P1-Sg see Perf VerbEnd/ 1.boy/ this/ in/ 9.town(3b) Ke mmone / kwa / toropong. (‘I saw him in town.’) AgrSubj-P1-Sg AgrObj-Cl1 see Perf VerbEnd/ in/ 9.town(3c) Ke mmone /kwa / toropong /, mosimane / yo. (‘I saw him in town, this boy.’) AgrSubj-P1-Sg AgrObj-Cl1 see Perf VerbEnd/ in/ 9.town/ 1.boy/ this

In Example (3c) the object agreement morpheme occurs with the object noun phrase. The full object NP mosimane yo is adjoined to the core clause. Within this noun phrase the demonstrative pronoun yo qualifies the noun mosimane.

Parameter 3: OM loc: Are there locative object markers?Object agreement morphemes of the locative class are used in Setswana.

Page 5: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 207

(4a) Ke itse motse wa rona. (‘I know our village.’) Ke itse / motse / wa / rona. AgrSubj-P1-Sg know VerbEnd/ 3.village/ of/ us(4b) Ke a go itse. (‘I know it (the place).’) AgrSubj-P1-Sg PRESTense AgrObj-Cl17-Sg know VerbEnd

Parameter 4a: One OM: Is object marking restricted to one object marking per verb?There may be more than one object agreement morpheme per verb in Setswana, as illustrated in

Examples (6) to (22).

Parameter 4b: Restr 2 OM: Are two object markers possible in restricted contexts?In instances (see the subsection ‘Linguistic instances …’ below) where two object agreement

morphemes are possible they may be used in Setswana, as illustrated in Examples (6) to (22).

Parameter 4c: Mult OM: Are two or more object markers freely available?Two object agreement morphemes may be used in Setswana in certain linguistic instances.

These instances include a number of verbs which allow indirect objects based on their meaning, certain verbs with the causative suffixes, certain possessive constructions, and some verbs with the applied suffixes (see the subsection ‘Verbs allowing indirect objects …’). The use of two object agreement morphemes appears infrequently in Setswana (Cole, 1955). Indeed, instances where more than two object agreement morphemes occur are not known in Setswana. The Setswana equivalent of the Chaga Example (21) with three object agreement morphemes, presented in Marten et al. (2007: 266), would be as follows:

(5) Kgosi e mo rometse teng ka sone. (‘The chief sent him there with it.’) Kgosi / e mo rometse / teng / ka / sone 9.chief/ AgrSubj-Cl9 AgrObj-Cl1 send Perf VerbEnd/ there/ with/ it

In (5) there is only one object agreement morpheme mo-. The absolute pronoun teng belonging to the locative noun class is used as a locative descriptive, while the instrumental particle ka and its complement, which in this case is the absolute pronoun of Class 7, appears as a descriptive of instrument.

Parameter 4d: Free order: Is the order of multiple object markers structurally free?Marten et al. (2007: 267) indicate that the order of multiple object agreement morphemes in

Setswana is free. Technically this is true, but it is our contention that there may be preferences based on discourse pragmatics. In examples where the order of multiple object agreement morphemes is free (refer to (c) in each of the examples below) an alternative order is also given here. Where an alternate order is not given it is semantically unacceptable. The reason for this phenomenon is not clear and more (corpus-based) research is necessary in this context.

Topic two: Double objectsParameter 5: Sym word-order: Can either object be adjacent to the verb?

Marten et al. (2007: 269, Example 27) indicate that either object may follow the verb in Setswana. Without any scientific proof to the contrary, this to us seems possible, but rather irregular. Object agreement morphemes may be used in any order (our Example 22), but it is doubtful whether this is possible with objects themselves. Since the focus in this article is on object agreement morphemes, the matter of the order of objects falls outside its scope.

Parameter 6: Sym passive: Can either object become subject under passivisation?Either object can be the subject of the sentence in the passive.

Page 6: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius208

Parameter 7: Sym OM: Can either object be expressed by an object agreement morpheme?Either object can be expressed by an object agreement morpheme.

Linguistic instances where multiple object agreement morphemes may appear in SetswanaUnder certain circumstances, indirect objects occur in the Bantu languages (Hyman & Duranti, 1982: 218). In Setswana, specifically, linguistic instances in which the verb may be followed by two objects (a direct and an indirect object) include a number of verbs allowing indirect objects based on their meaning; some verbs with the causative suffix; some possessive constructions; and some verbs with the applied suffix. Both these objects (direct and indirect) can be represented in the verb by object agreement morphemes, which serves as confirmation that a verb could have two object agreement morphemes. The examples in the four subsections below (‘Verbs allowing indirect objects based on their meaning’, ‘Objects following verbs containing the causative suffix’, ‘Possessive group in object position’ and ‘Objects following verbs containing the applied suffix’) follow a similar pattern of system-atically replacing the objects in the sentences by object agreement morphemes.

Verbs allowing indirect objects based on their meaningThe meaning of a number of Setswana verbs allow them to take two objects without being inflected with a suffix. The following examples are cases in point:

go fa (‘to give’)(6a) Morutabana o fa ngwana dibuka. (‘The teacher gives the child books.’) Morutabana / o fa / ngwana / dibuka. 1.teacher/ AgrSubj-Cl1 give VerbEnd/ 1.child/ 10.books(6b) Morutabana / o mo fa / dibuka. (‘The teacher gives him/her books.’) 1.teacher/ AgrSubj-Cl1 AgrObj-Cl1 give VerbEnd/ 10.books(6c) Morutabana / o a mo di fa. (‘The teacher gives it to him/her.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl1 AgrObj-Cl10 give VerbEnd or Morutabana / o a di mo fa. (‘The teacher gives it to him/her.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl10 AgrObj-Cl1 give VerbEnd

(7a) Mogwebi o file moreki borokgwe. (‘The merchant gave the buyer a pair of trousers.’) Mogwebi / o file / moreki / borokgwe. 1.merchant/ AgrSubj-Cl1 give Perf VerbEnd/ 1.buyer/ 14.trousers(7b) Mogwebi / o mo file / borokgwe. (‘The merchant gave him/her a pair of trousers.’) 1.merchant/ AgrSubj-Cl1 AgrObj-Cl1 give Perf VerbEnd/ 14.trousers(7c) Mogwebi / o mo bo file. (‘The merchant gave it to him/her.’) 1.merchant/ AgrSubj-Cl1 AgrObj-Cl1 AgrObj-Cl14 give Perf VerbEnd or Mogwebi / o bo mo file. (‘The merchant gave it to him/her.’) 1.merchant/ AgrSubj-Cl1 AgrObj-Cl14 AgrObj-Cl1 give Perf VerbEnd

(8a) Ke file wena dinotlolo. (‘I gave you the keys.’) Ke file / wena / dinotlolo. AgrSubj-P1-Sg give Perf VerbEnd/ PersPron-P2-Sg/ 8.keys(8b) Ke go file / dinotlolo. (‘I gave you the keys.’) AgrSubj-P1-Sg AgrObj-P2-Sg give Perf VerbEnd/ 8.keys(8c) Ke go di file. (‘I gave it to you / you it.’) AgrSubj-P1-Sg AgrObj-P2-Sg AgrObjCl8 give Perf VerbEnd or Ke di go file. (‘I gave it to you / you it.’) AgrSubj-P1-Sg AgrObjCl8 AgrObj-P2-Sg give Perf VerbEnd

Page 7: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 209

go tshasa (‘to spread / put on’)(9a) Mme o tshasitse borotho botoro. (‘Mother spread the bread (with) butter.’) Mme / o tshasitse / borotho / botoro. 1a.mother/ AgrSubj-Cl1 spread Perf VerbEnd/ 14.bread/ 9.butter(9b) Mme / o bo tshasitse / botoro. (‘Mother spread it (with) butter.’) 1a.mother/ AgrSubj-Cl1 AgrObj-Cl14 spread Perf VerbEnd/ 9.butter(9c) Mme / o bo e tshasitse. (‘Mother spread it (with) it.’) 1a.mother/ AgrSubj-Cl1 AgrObj-Cl14 AgrObj-Cl9 spread Perf VerbEnd

go kopa (‘to ask’)(10a) Karabo / o kopa / monna / tsela. (‘Karabo asks the man the way/road.’) 1a.Karabo/ AgrSubj-Cl1 ask VerbEnd/ 1.man/ 9.road(10b) Karabo / o mo kopa / tsela. (‘Karabo asks him the way/road.’) 1a.Karabo/ AgrSubj-Cl1 AgrObj-Cl1 ask VerbEnd/ 9.road(10c) Karabo / o a mo e kopa. (‘Karabo asks him it.’) 1a.Karabo/ AgrSubj-Cl1 PREStense AgrObj-Cl1 AgrObj-Cl9 ask VerbEnd or Karabo / o a e mo kopa. (‘Karabo asks him it.’) 1a.Karabo/ AgrSubj-Cl1 PREStense AgrObj-Cl9 AgrObj-Cl1 ask VerbEnd

go ruta (‘to teach’)(11a) Morutabana o ruta bana dipalo. (‘The teacher teaches the children maths.’) Morutabana / o ruta / bana / dipalo. 1.teacher/ AgrSubj-Cl1 teach VerbEnd/ 2.children/ 10.maths(11b) Morutabana / o ba ruta / dipalo. (‘The teacher teaches them maths.’) 1.teacher/ AgrSubj-Cl1 AgrObj-Cl2 teach VerbEnd/ 10.maths(11c) Morutabana / o a ba di ruta. (‘The teacher teaches them it.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl2 AgrObj-Cl10 teach VerbEnd or Morutabana / o a di ba ruta. (‘The teacher teaches them it.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl10 AgrObj-Cl2 teach VerbEnd

Objects following verbs containing the causative suffixThe causative suffix conveys the meaning of ‘to let …’. The causative suffix is a valency changing suffix and introduces a new object apart from the direct object, for example:

go kwala (‘to write’)(12a) Morutabana o kwadisa bana teko. (‘The teacher lets the children write a test.’) Morutabana / o kwadisa / bana / teko. 1.teacher/ AgrSubj-Cl1 write Caus VerbEnd/ 2.children/ 9.test(12b) Morutabana / o ba kwadisa / teko. (‘The teacher lets them write a test.’) 1.teacher/ AgrSubj-Cl1 AgrObj-Cl2 write Caus VerbEnd/ 9.test(12c) Morutabana / o a ba e / kwadisa. (‘The teacher lets them write it.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl2 AgrObjCl9 write Caus

VerbEnd or Morutabana / o a e ba / kwadisa. (‘The teacher lets them write it.’) 1.teacher/ AgrSubj-Cl1 PREStense AgrObj-Cl9 AgrObjCl2 write Caus VerbEnd

go raga (‘to kick’)(13a) Mokatisi o ragisa batshameki bolo. (‘The trainer lets the players kick the ball.’) Mokatisi / o ragisa / batshameki / bolo. 1.trainer/ AgrSubj-Cl1 kick Caus VerbEnd/ 2.players/ 9.ball

Page 8: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius210

(13b) Mokatisi / o ba ragisa / bolo. (‘The trainer lets them kick the ball.’) 1.trainer/ AgrSubj-Cl1 AgrObj-Cl2 kick Caus VerbEnd/ 9.ball(13c) Mokatisi / o a ba e ragisa. (‘The trainer lets them kick it.’) 1.trainer/ AgrSubj-Cl1 PREStense AgrObj-Cl2 AgrObj-Cl9 kick Caus VerbEnd

go pagama (‘go climb/get on’)(14a) Mokgweetsi o pagamisa baeti bese. (‘The driver lets the passengers climb on the bus.’) Mokgweetsi / o pagamisa / baeti / bese. 1.driver/ AgrSubj-Cl1 climb on Caus VerbEnd/ 2.passengers/ 9.bus(14b) Mokgweetsi / o ba pagamisa / bese. (‘The driver lets them climb on the bus.’) 1.driver/ AgrSubj-Cl1 AgrObjCl2 climb on Caus VerbEnd/ 9.bus(14c) Mokgweetsi / o a ba e pagamisa. (‘The driver lets them climb on it.’) 1.driver/ AgrSubj-Cl1 PREStense AgrObjCl2 AgrObjCl9 climb on Caus VerbEnd

go ja (‘to eat’)(15a) Mosadi o jesa bana bogobe. (‘The woman lets the children eat porridge.’) Mosadi / o jesa / bana / bogobe. 1.woman/ AgrSubj-Cl1 eat Caus VerbEnd/ 2.children/ 14.porridge(15b) Mosadi /o ba jesa / bogobe. (‘The woman lets them eat porridge.’) 1.woman/ AgrSubj-Cl1 AgrObjCl2 eat Caus VerbEnd/ 14.porridge(15c) Mosadi / o a ba bo jesa. (‘The woman lets them eat it.’) 1.woman / AgrSubj-Cl1 PREStense AgrObjCl2 AgrObjCl14 eat Caus VerbEnd

Possessive group in object positionThis section addresses instances where sequences of nouns appear in object position and the first noun represents a possessor affected by the action of the verb. Such constructions are typically referred to as examples of inalienable possession. The members of possessive groups in Setswana are the inverse of possessive groups in English. In possessive constructions in Setswana the possession occurs in initial position while the possessor follows the possessive particle, for example:

(16) letsogo la mosadi (‘the arm of the woman / the woman’s arm’) letsogo / la / mosadi 5.arm/ PosPart-Cl5/ 1.woman

The following are examples where possessive constructions appear as multiple objects:

go kgaola (‘to cut off/amputate’)(17a) Ngaka e kgaotse mosadi letsogo. (‘The doctor amputated the woman’s arm.’) Ngaka / e kgaotse / mosadi / letsogo. 9.doctor/ AgrSubj-Cl9 amputate Perf VerbEnd/ 1.woman/ 5.arm(17b) Ngaka / e mo kgaotse / letsogo. (‘The doctor amputated her arm.’) 9.doctor/ AgrSubj-Cl9 AgrObj-Cl1 amputate Perf VerbEnd/ 5.arm(17c) Ngaka / e mo le kgaotse. (‘The doctor amputated hers.’) 9.doctor/ AgrSubj-Cl9 AgrObj-Cl1 AgrObjCl5 amputate Perf VerbEnd

(18a) Modiri o kgaotse setlhare dikala. (‘The worker cut off the tree’s branches.’) Modiri / o kgaotse / setlhare / dikala. 1.worker/ AgrSubj-Cl1 cut off Perf VerbEnd/ 7.tree/ 10.branches(18b) Modiri / o se kgaotse / dikala. (‘The worker cut its branches off.’) 1.worker/ AgrSubj-Cl1 AgrObj-Cl7 cut off Perf VerbEnd/ 10.branches(18c) Modiri / o se di kgaotse. (‘The worker cut its off.’) 1.worker/ AgrSubj-Cl1 AgrObj-Cl7 AgrObj10 cut Perf VerbEnd

Page 9: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 211

go golafala (‘to get injured’)(19a) Kotsi e golafaditse mosimane lenao. (‘The accident got the boy’s leg injured.’) Kotsi / e golafaditse / mosimane / lenao. 9.accident/ AgrSubj-Cl9 injure Caus Perf VerbEnd/ 1.boy/ 5.leg(19b) Kotsi / e mo golafaditse / lenao. (‘The accident got his leg injured.’) 9.accident/ AgrSubj-Cl9 AgrObj-Cl1 injure Caus Perf VerbEnd/ 5.leg(19c) Kotsi / e mo le golafaditse. (‘The accident got his injured.’) 9.accident/ AgrSubj-Cl9 AgrObj-Cl1 AgrObjCl5 injure Caus Perf VerbEnd

go tshwaya (‘to mark’)(20a) Moruadikgomo o tshwaile dinamane ditsebe. (‘The cattle farmer marked the calves’ ears.’) 1.cattle farmer/ AgrSubj-Cl1 mark Perf VerbEnd/ 10.calves/ 10.ears(20b) Moruadikgomo / o di tshwaile / ditsebe. (‘The cattle farmer marked their ears.’) 1.cattle farmer/ AgrSubj-Cl1 AgrObj-Cl10 mark Perf VerbEnd/ 10.ears(20c) Moruadikgomo / o di di tshwaile. (‘The cattle farmer marked theirs.’) 1.cattle farmer/ AgrSubj-Cl1 AgrObj-Cl10 AgrObj-Cl10 mark Perf VerbEnd or Moruadikgomo / o di di tshwaile. (‘The cattle farmer marked theirs.’) 1.cattle farmer/ AgrSubj-Cl10 AgrObj-Cl1 AgrObj-Cl10 mark Perf VerbEnd

Objects following verbs containing the applied suffixThe applied suffix conveys the meaning of ‘to … for’. Similar to the causative suffix, the applied suffix is also a valency changing suffix introducing a new object apart from the direct object, for example:

go apaya (‘to cook’)(21a) Mme o apeetse bana dijo. (‘Mother cooked the children food./Mother cooked food for the

children.’) Mme / o apeetse / bana / dijo. 1.mother/ AgrSubj-Cl1 cook Appl Perf VerbEnd/ 2.children/ 8.food(21b) Mme / o ba apeetse / dijo. (‘Mother cooked them food./Mother cooked food for them.’) 1.mother/ AgrSubj-Cl1 AgrObj-Cl2 cook Appl Perf VerbEnd/ 8.food(21c) Mme / o ba di apeetse. (‘Mother cooked them it./Mother cooked it for them.’) 1.mother/ AgrSubj-Cl1 AgrObj-Cl2 AgrObjCl8 cook Appl Perf VerbEnd

go roma (‘to send’)(22a) Ntate o rometse malome lekwalo. (‘Father sent uncle a letter.’) Ntate / o rometse / malome / lekwalo. 1a.father/ AgrSubj-Cl1 send Appl Perf VerbEnd/ 1.uncle/ 5.letter(22b) Ntate / o mo rometse / lekwalo. (‘Father sent him a letter.’) 1a.father/ AgrSubj-Cl1 AgrObj-Cl1 send Appl Perf VerbEnd/ 5.letter(22c) Ntate / o mo le rometse. (‘Father sent him it.’) 1a.father/ AgrSubj-Cl1 AgrObj-Cl1 AgrObjCl5 send Appl Perf VerbEnd or Ntate / o le mo rometse. (‘Father sent him it.’) 1a.father/ AgrSubj-Cl1 AgrObj-Cl5 AgrObjCl1 send Appl Perf VerbEnd

Setswana tokenisation and computational verb morphology for multiple object agreement morphemesThe identification of multiple object agreement morphemes in a corpus of running Setswana text requires correct tokenisation and subsequent morphological analysis in order to identify tokens that exhibit verb structure with multiple object agreement morphemes. The development of the tokeniser and morphological analyser for Setswana was done with the Xerox finite state toolkit (Beesley &

Page 10: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius212

Karttunen, 2003), as described in detail by Pretorius L et al. (2009) and Pretorius et al. (2005). This section provides a brief overview of the salient features of both the tokeniser and the morphological analyser.

TokenisationTokenisation is the segmentation of running text, as usually found in an electronic corpus, first into sentences and then into tokens such as words, numbers, punctuation marks, parentheses, and similar entities. Tokenisation represents the first stage in the natural language processing pipeline that is usually applied in corpus linguistics. Subsequent stages are, for example, morphological analysis, word sense disambiguation, shallow parsing, deep syntactic analysis/parsing, and so on.

In most alphabetic languages (such as English, isiZulu, Afrikaans, Greek, Arabic, and so on) lin-guistic words are surrounded by whitespace. Therefore, a simple rule for tokenisation based on whitespace and punctuation is mostly quite accurate. An excellent exposition of the complexities of tokenisation may be found in Schmid (2008). In Setswana the tokenisation of running text represents an additional challenge due to the disjunctive writing style adopted in verb constructions, as briefl y discussed in the fi rst section and elaborated on by Pretorius R et al. (2009).

In essence, the tokenisation algorithm used for Setswana is based on three principles: (i) whitespace forms part of the ‘alphabet’ for writing verb constructions; (ii) a finite state grammar for verb constructions is combined with a longest right-to-left match strategy to constrain the occurrence of whitespace, allowed in (i), to verb construction internal use only; and (iii) the systematic subdivi-sion of candidate verb constructions that are too long. This approach ensures that tokenisation on whitespace, outside of valid Setswana verb constructions, proceeds as is customary for alphabetic languages.

A fragment of the verb construction grammar, written in xfst, is shown in Figure 1. All the rules are self-explanatory and serve to identify valid candidate token patterns according to Setswana verb structure, except the last rule, which effects the actual tokenisation by inserting a new line symbol (NL) after an identified token. The modelling of multiple object agreement morphemes is done by the rule multiple_obj_morphemes -> obj_morpheme (obj_morpheme) which allows for at most two object agreement morphemes in Setswana.

Notation: [Char]+ denotes any sequence of one or more alphabetic characters, ‘(’ and ‘)’ indicate optionality, ‘|’ is the union operator, ‘ ’ is whitespace and NL represents a new line. Terminal symbols are shown in small letters. For example, the xfst instruction

defi ne A [a|b|c(d)];means that the non-terminal A can be rewritten as the terminal symbol strings a, b, c or cd.

Two rules offered by the Xerox regular expression calculus deserve special attention, with A, B, C, L, R denoting regular languages in the formal language theory sense.

(i) Marking: [A -> B ... C] is used to bracket any instance of the language A with any instance of the language B to the left and any instance of the language C to the right. In this rule ‘...’ is a Xerox defi ned operator.

(ii) Directional conditional replacement: The rule A ->@ B || L _ R means that replacement strings are selected from right to left and if more than one candidate matching string begins at a given location, only the longest one is replaced. Moreover, this replacement only takes place if the left con-text of the replacement string is in the language L and the right context is in the languages R.

The longest right-to-left match strategy is implemented by means of the combination of rules of types (i) and (ii): TOKEN ->@ ... NL || ‘ ’ _ ensures that from a given position in the input stream of characters the longest possible string from right to left, which corresponds to the TOKEN pat-tern, is matched, provided that it has whitespace to its left, and is then replaced by itself (denoted by …)followed by a new line marker.

The application of the tokeniser is discussed in the next section.We conclude the discussion of the tokeniser by illustrating the signifi cance of the notion of lon-

gest match by means of an example. Consider the input stream o mo le rometse. From Figure 2 it should be clear that both the strings o mo le and o mo le rometse satisfy the TOKEN pattern, but, if used, the shorter token will cause rometse to be a separate, but not valid token. Since

Page 11: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 213

the notion of keeping track of what input has been processed, falls outside the theoretical framework of fi nite state methods it is, in principle, not possible to backtrack, ruling out the immediate recovery from this incorrect tokenisation. In cases where, on the other hand, the longest match is too long and does not yield a valid token, an iterative process of systematic subdivision may be followed to identify the shorter valid tokens. The intuition here is that by allowing shorter candidate tokens (o mo le / rometse) we relinquish ‘context’ that cannot be recovered easily, while in the case of possibly too long tokens we preserve ‘context’ for future exploitation. A detailed discussion of this algorithm falls outside the scope of this article and may be found in Pretorius R et al. (2009).

Morphological analysisIt is well known that the two central problems in morphology are word formation or morphotac-tics (governing the morpheme sequencing in words) and morphophonological alternations rules (governing the sound changes that, among others, often occur at morpheme boundaries). The finite state approach to computational morphology remains a state of the art formalism by which morpho-tactics and alternation may be accurately and efficiently modelled, implemented as so-called finite state networks and composed to form a single so-called finite state transducer (fst), which then constitutes the morphological analyser. It is well known that fst’s follow all paths and therefore produce all valid analyses of a given token, as illustrated for the token o a mo di fa in Figure 2. The disambiguation of these various analyses is context dependent and falls outside the scope of morphological analysis. Another important characteristic of fst’s is their inherent bidirectionality by which they facilitate both analysis and generation, as shown in Figure 3.

The Xerox tools (Beesley & Karttunen, 2003) and related open source initiatives such as Foma (Hulden, 2009) and HFST (Lindén et al., 2011), offer comprehensive and mature support for the development of finite state morphological analysers. The morphotactics are specified by means of lexc, a high-level declarative language for natural language lexicons, including word root lexicons. This means that only words the roots of which are present in the word root lexicons can be successfully analysed and generated. The alternations are modelled by means of the xfst tool,

Figure 1: Fragment of verb construction grammar for tokenisation

..defi ne WORD [Char]+’ ‘;defi ne STRINGwithVERBEnding [Char]+[a|e|n g]’ ‘;defi ne SUBJ_MORPHEME [k e|o|r e|l o|l e|a|b a|e|s e|d i|b o|g o] ‘ ‘;defi ne OBJ_MORPHEME [g o|r e|l o|l e|m o|b a|o|e|a|s e|d i|b o]’ ‘; defi ne MULTIPLE_OBJ_MORPHEMES OBJ_MORPHEME (OBJ_MORPHEME);..defi ne IMP_MOOD_PREF [(SUBJ_MORPHEME) (se’ ‘) (MULTIPLE_OBJ_MORPHEMES)];defi ne IND_MOOD_PREF [(g a ‘ ‘) SUBJ_MORPHEME ([a|s a ] ‘ ‘) ([a|k a|s a] ‘ ‘) (t l a ‘ ‘) (MULTIPLE_OBJ_MORPHEME)];defi ne INF_MOOD_PREF [g o ‘ ‘ (s e ‘ ‘) ([k a|s a] ‘ ‘) (MULTIPLE_OBJ_MORPHEMES)];..defi ne VERB_PREFIXES [ ..| IMP_MOOD_PREF | IND_MOOD_PREF | INF_MOOD_PREF |..];defi ne VERB_CONSTR [VERB_PREFIXES STRINGwithVERBEnding];defi ne TOKEN [..| VERB_CONSTR | WORD |..];../* longest right-to-left match -- insert a newline after each token */Defi ne TOK1 [TOKEN ->@ ... NL || ‘ ‘ _];..

Page 12: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius214

which supports an extended regular expression calculus; including sophisticated replace rules (see also the discussion of the tokeniser above). A detailed discussion of how lexc and xfst are used to develop the Setswana morphological analyser falls outside the scope of this article, but is covered in some detail in Pretorius et al. (2005).

The finite state morphological analyser for Setswana was designed to accommodate verb constructions with internal whitespace to allow for their disjunctive writing style and to make provision for multiple object agreement morphemes as they occur in Setswana. It covers the complete morphology of Setswana, as described in the standard references (Cole, 1955; Krüger, 2006). However, in terms of word root lexicons the coverage of the morphological analyser is limited by the word root information that is currently available in Setswana paper dictionaries. Increasing the coverage of the morphological analyser remains a challenge and will require continued updating and enhancement over time of the word root lexicons as new words are added to the language or discovered, for example, in electronically available language corpora.

A particularly useful feature of the Xerox finite state tools is the possibility to use a so-called guesser variant of the morphological analyser (Beesley & Karttunen, 2003: 444–451). Unlike a normal morphological analyser, wherein the set of known roots is explicitly enumerated, a guesser is designed to analyse words that are based on any phonologically possible root. The set of phonologically possible roots is defined more or less precisely, using regular expres-sions. The xfst script that is used in the present version of the guesser is shown in Figure 4. It is based on consonant and vowel patterns observed in the word root lexicons of the current version of morphological analyser for Setswana. The main difference between the phonologi-cally possible roots for noun roots and verb roots is that noun roots require a root final vowel. The script also makes provision for a special tag, [Guess], to distinguish between the analyses provided by the morphological analyser and the guesser, respectively. The strategy is to apply the normal morphological analyser to all the tokens and to submit to the guesser for analysis only those tokens that could not be successfully analysed by the normal analyser. By subsequently subjecting the ‘guessed’ candidate roots to human elicitation, new roots may be identified and added to the word root lexicons of the analyser. In this article the guesser is proposed as a tool for identifying occurrences of multiple object agreement morphemes in a corpus even in cases where the verb root is not present in the verb root lexicon of the analyser. A proof-of-concept application is discussed in the next section.

A corpus-based approach to multiple object agreement morpheme extractionIn this section we illustrate the use of the tokeniser, the morphological analyser and its guesser variant in mining an electronic corpus for morphological information – in this case the occurrence of verb constructions with two object agreement morphemes. In this proof-of-concept experiment the test suite of examples in the first two sections of the paper is used, together with one verb construction based on a root that is not yet in the word root lexicon of the analyser, viz. gap (‘take by force’).

Figure 2: Valid morphological analyses of the token o a mo di fa

VerbINDmoodPRESPos+AgrSubj-2p-Sg+AspPre+AgrObj-Cl1+AgrObj-Cl10+[f]+VerbEnd

VerbINDmoodPRESPos+AgrSubj-2p-Sg+AspPre+AgrObj-Cl1+AgrObj-Cl8+[f]+VerbEnd

VerbINDmoodPRESPos+AgrSubj-Cl1+AspPre+AgrObj-Cl1+AgrObj-Cl10+[f]+VerbEnd

VerbINDmoodPRESPos+AgrSubj-Cl1+AspPre+AgrObj-Cl1+AgrObj-Cl8+[f]+VerbEnd

VerbINDmoodPRESPos+AgrSubj-Cl3+AspPre+AgrObj-Cl1+AgrObj-Cl10+[f]+VerbEnd

VerbINDmoodPRESPos+AgrSubj-Cl3+AspPre+AgrObj-Cl1+AgrObj-Cl8+[f]+VerbEnd

Page 13: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 215

The tokenisation procedure, as described in the third section, yielded 198 tokens. On subjecting these tokens to a morphological analysis, two tokens, viz. wa rona and ka sone were found to be incorrect since both the morphological analyser and the guesser failed to produce any analysis. Therefore, these two tokens are typical candidates for further subdivision, the detailed discussion of which is outside the scope of this study. All the other tokens, except o mo le gapetse were successfully analysed. Selected examples are shown below. All occurrences of multiple object agreement morphemes were identified. In the case of o mo le gapetse, the guesser spotted the multiple object agreement morphemes in the verb construction in spite of the fact that the word root was not known to the analyser and yielded a number of different analyses, based on two phonologically possible roots, viz. gapets[Guess] and gap[Guess], also shown below. Human elicitation subsequently rendered the verb root gap suitable for addition to the verb root lexicon of the analyser. (A list of the tags used in the morphological analyses below and their meanings are provided in the appendix.)

Test suite (Example (22)): < go roma. ntate o rometse malome lekwalo. ntate o mo rometse lekwalo. ntate o mo le rometse. ntate o le mo rometse. ntate o le mo gapetse. >

Output of the tokeniser: go roma / ntate / o rometse / malome / lekwalo / ntate / o mo rometse / lekwalo / ntate / o mo le rometse / ntate / o le mo rometse / ntate / o mo le gapetse

Output of the morphological analyser:In the morphological analyses of verbs we follow the notational convention that the features are given at the beginning of the analysis. These features are separated by a ‘-’. The ‘+’ separates the tags that are associated with the actual morphemes. For example, go roma is a verb in the

Figure 3: Schematic illustration of the bidirectionality of the Setswana morphological analyser/generator (see Example (6c), with verb root [f])

o a mo di fa

Verb-INDmood-PRESPos+AgrSubj-Cl1+AspPre+AgrObj-Cl1+AgrObj-Cl10+[f]+VerbEnd

Tswana morphological analyser/generator fst

Figure 4: Phonologically possible roots for the Setswana guesser

defi ne V [a|e|o|i|u];

defi ne C [b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z];

defi ne VerbRoot [[(V) C (C (C)) V (V)]+ C (C (C)) ‘[Guess]’:0];

defi ne NounRoot [[(V) C (C (C)) V [C (C (C)) V]+ ‘[Guess]’:0];

Page 14: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius216

infinitive mood, present tense and positive (Verb-INFmood-PRES-Pos). Moreover, go is tagged as InfPre, the root roma is tagged as [rom] and the verbal ending is tagged as VerbEnd.

go romaVerb-INFmood-PRES-Pos+InfPre+[rom]+VerbEnd

ntateNPre1a+[ntate]

o rometseVerb-INDmood-PERF-Pos+AgrSubj-Cl1+[rom]+Appl+Perf+VerbEnd

malomeNPre1a+[malome]

lekwaloNPre5+[kwal]+DeverbSufo mo rometseVerb-INDmood-PERF-Pos+AgrSubj-Cl1+AgrObj-Cl1+[rom]+Appl+Perf+VerbEnd

o mo le rometseVerb-INDmood-PERF-Pos+AgrSubj-Cl1+AgrObj-Cl1+AgrObj-Cl5+[rom]+Appl+Perf+VerbEnd

o le mo rometseVerb-INDmood-PERF-Pos+AgrSubj-Cl1+AgrObj-Cl5+AgrObj-Cl1+[rom]+Appl+Perf+VerbEnd

Selected output of the guesser:o mo le gapetseVerb-INDmood-PERF-Pos+AgrSubj-Cl1+AgrObj-Cl1+AgrObj-Cl5+[gap[Guess]]+Appl+Perf+VerbEndVerb-INDmood-POT-Neg+AgrSubj-Cl1+AgrObj-Cl1+AgrObj-Cl5+[gapets[Guess]]+VerbEnd

The procedure, outlined and illustrated above, scales well to corpora of increasing size and allows the corpus-based investigation of a variety of morphological phenomena of interest. It therefore constitutes a useful and novel approach to corpus-based morphological studies in Setswana.

Conclusion and future workIn summary, the contribution of this article is twofold. Firstly, the work of Marten et al. (2007) was discussed and extended with accurate Setswana examples that were validated by a linguist who is a mother tongue speaker. Secondly, a computational corpus-based approach to investi-gating the occurrence of object agreement morphemes in Setswana verb constructions was proposed and demonstrated by applying it to the test suite of examples in the article. An existing finite state tokeniser and morphological analyser were used for this purpose. The point was made that the approach scales well and may be extended to other topics of investigation in Setswana morphology. We also alluded to the importance of the development of a suitable corpus towards a more extensive study of the mentioned phenomenon. Since corpus development is a specialised and resource intensive enterprise, this will form part of our future work in this field.

Page 15: Multiple object agreement morphemes in Setswana: A computational approach

Southern African Linguistics and Applied Language Studies 2012, 30(2): 203–218 217

ReferencesBeesley KR & Karttunen L. 2003. Finite state morphology. Stanford, CA: CSLI Publications.Central Statistics Office. 2009. Demographic survey 2006. Gaberone: Government Printer.Cole DT. 1955. An introduction to Tswana grammar. Cape Town: Longman.Cole DT. 1961. Doke’s classification of Bantu languages. In Doke CM & Cole DT (eds)

Contributions to the history of Bantu linguistics. Johannesburg: Witwatersrand University Press, pp 80–96.

Doke CM. 1955. The Southern Bantu languages. London: Oxford University Press.Hulden M. 2009. Foma: A finite-state compiler and library. Proceedings of the EACL 2009

Demonstrations Session, pp 29–32, Athens, Greece, 3 April 2009.Hyman LM & Duranti A. 1982. On the object relation in Bantu. In Hopper P & Thompson S (eds)

Studies in transitivity (Syntax and Semantics 15). New York: Academic Press, pp 217–239.Kosch IM. 2006. Topics in morphology in the African language context. Pretoria: Unisa Press.Krüger CJH. 2001. Tswana: Sintaksis. Study guide – Honours. PU for CHE: Potchefstroom. Krüger CJH. 2006. Introduction to the morphology of Tswana. München: Lincom.Lindén K, Silfverberg M, Axelson E, Hardwick S & Pirinen TA. 2011. HFST: Framework

for compiling and applying morphologies. In Mahlow C & Pietrowski M (eds) Systems and frameworks for computational morphology. Communications in Computer and Information Science , vol. 100. Springer, pp 67–85 .

Louwrens LJ. 1994. Dictionary of Northern Sotho grammatical terms. Pretoria: Via Afrika.Louwrens LJ & Poulus G. 2006. The status of the word in selected conventional writing systems:

The case of disjunctive writing. Southern African Linguistics and Applied Language Studies 24(3): 389–401.

Lüdeling A & Kytö M. 2008. Corpus linguistics: An international handbook, volume 1. Berlin: Walter de Gruyter.

Maegaard B, Krauwer S, Choukri K & Damsgaard Jørgensen L. 2006. The BLARK concept and BLARK for Arabic. Proceedings of LREC, Genoa, Italy, pp 773–778.

Marten L, Kula NC & Thwala N. 2007. Parameters of morphosyntactic variation in Bantu. Transactions of the Philological Society 105(3): 253–338.

Nurse D. 2008. Tense and aspect in Bantu. Oxford: Oxford University Press.Posthumus LC. 1994. Word-based versus root-based morphology in the African Languages. South

African Journal of African Languages 14(1): 1–40.Pretorius L, Viljoen B, Pretorius R & Berg A. 2008. Towards a computational morphological

analysis of Tswana compounds. Literator 29(1): 1–20.Pretorius L, Viljoen B, Pretorius R & Berg A. 2009. A finite state approach to Tswana verb

morphology. FSMNLP 2009 Pre-proceedings, Eighth International Workshop on Finite-State Methods and Natural Language Processing, University of Pretoria, Pretoria, pp 131–138.

Pretorius RS, Pretorius L & Viljoen B. 2005. A finite state morphological analysis of Tswana nouns. South African Journal of African Languages 25(1): 37–47.

Pretorius R, Berg A, Pretorius L & Viljoen B. 2009. Tswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. AfLaT2009 Proceedings, EACL 2009 Workshop on Language Technologies for African Languages. Athens, Greece, pp 66–73.

Schmid H. 2008. Tokenising and part-of-speech tagging. In Lüdeling A & Kytö M (eds) Corpus linguistics: An international handbook, volume 1. Berlin: Walter de Gruyter, pp 527–551.

Van Wyk EB. 1987. Theory and grammatical description: The case of the verb categories in Northern Sotho. African Studies 46(2): 275–286.

Zerbian Z. 2006. Expression of information structure in the Bantu Language Northern Sotho. PhD thesis, Zentrum für Allgemeine Sprachwissenschaft, Berlin.

Page 16: Multiple object agreement morphemes in Setswana: A computational approach

Pretorius, Berg and Pretorius218

Appendix: Morphological tags in the text

Tag MeaningAgrObj-P2-Sg Object agreement morpheme of second person singularAgrObj-3P-Sg Object agreement morpheme of third person singularAgrObj-Cl1 Object agreement morpheme of Class 1AgrObj-Cl2 Object agreement morpheme of Class 2AgrObj-Cl5 Object agreement morpheme of Class 5AgrObjCl7 Object agreement morpheme of Class 7AgrObjCl8 Object agreement morpheme of Class 8AgrObjCl9 Object agreement morpheme of Class 9AgrObjCl10 Object agreement morpheme of Class 10AgrObjCl14 Object agreement morpheme of Class 14AgrObj-Cl17 Object agreement morpheme of Class 17AgrSubj-3P-Sg Subject agreement morpheme of third person singularAgrSubj-P1-Sg Subject agreement morpheme of first person singularAgrSubj-Cl1 Subject agreement morpheme of Class 1AgrSubj-Cl9 Subject agreement morpheme of Class 9Appl Applicative verbal suffixCaus Causative verbal suffixDeverbSuf Deverbative suffixFUT Future tenseINDmood Indicative moodINFmood Infinitive moodInfPre Infinitive prefixNeg NegativeNPre1a Noun class prefix of Class 1aNPre5 Noun class prefix of Class 5PERF Perfect tensePerf Perfect tense verbal suffixPersPron-P2-Sg Personal pronoun of second person singularPos PositivePosPart-Cl5 Possessive particle of Class 5POT PotentialPRES Present tenseVerb Part-of-speechVerbEnd Verb ending (verb final vowel)[gap[Guess]] Guessed verb root gap[gapets[Guess]] Guessed verb root gapets[ntate] [noun root], here ntate (‘father’)[rom] [verb root], here rom (‘send’)