instructions fortext analysis exercises - l&s...

Instructions for Text Analysis Exercises These exercises will teach you how to use Toolbox to analyze and interlinearize texts. To do these exercises, you should have this file open in Microsoft Word. Set Word to normal view (choose View, Normal). Set options to “Wrap to Window”. (Choose Tools, Options, View, and turn on the “Wrap to Window” check box.) Position this instructions window so that it covers about the right third of the screen.

Exercise 1: Interlinearization This exercise will begin the process of interlinearizing text. Open the exercise project. Toolbox should show two windows, one titled "Frog Meets Fish.txt", and one titled “Amadup.dic”.

Introduction Text is interlinearized to show its morphological structure. Each word is broken in to morphemes, and each morpheme is given a gloss and a part of speech. In the process of interlinearizing a text sentence each new root or affix that is entered into the dictionary. In this set of exercises, we will be analyzing a text in a language named Amadup. It is not a real language, but was created for illustrative purposes. Its orthography uses approximately the English pronunciation of the letters, not IPA. For example, “ch” is the voiceless equivalent of “j”. Scroll down the text window and read the free translation of the text. Scroll back up to the top of the window.

Interlinearizing The first thing we do with a new sentence of text is to interlinearize it. This puts interlinear lines under the text line. It parses and glosses the text using information from the lexicon. To interlinearize, we first put the cursor at the front of the sentence we want to interlinearize. We interlinearize the sentence by choosing Tools, Interlinearize, or we can use the shortcut Alt+I, or we can use the Interlinearize button, which is the button to the left of the browse button. Interlinearize sentence 1 (Alt+I or the Interlinearize button). You will see three lines added below the \tx line. They are \mb (Morpheme Breaks), \ge (Gloss English), and \ps (Part of Speech). The morpheme breaks line shows the same words as the text line, except that each word has a star (asterisk) in front of it. The star means that the word did not parse. In this example, none of the words can be parsed because there are no words in the lexicon yet. The \ge and \ps lines each contain 3 stars under each word. This means that the words were not found in the lexicon.

Beginning Analysis One way to begin analysis is to look for nouns or verbs that occur multiple times in the text, and to see what possible equivalents occur in the corresponding free translations. A good way to do this is to compare the references in a pair of word lists, using one word list from the text and another word list from the free translation.

Making Word Lists Choose Tools, Word List. Notice that it defaults to making a word list of a text corpus named “Amadup Texts”. Click "Create" to create the word list. You will see a wordlist showing all the Amadup words in the Frog Meets Fish file. To make a word list of the free translations, we use a text corpus that looks at the same files, but looks at the free translation field instead of the text field. Choose Tools, Wordlist. Select the “Amadup Free Translation” text corpus. Click Create. You will see a wordlist of all the English words from the free translations of the sentences.

Using Right Click to Jump to a Word List A powerful feature of Toolbox is that a right click on a word can make other windows “jump” to show that word. Right click on the word “lyfch” in the \tx line of the text. The Amadup word list jumps to show “lyfch”. You can see that it occurs 4 times, one time each in sentences 1 through 4. Next we will look at the references of the words in the free translation to see what is the best match. Right click on “Long” in the \ft line. The free translation word list jumps to show “long”. You can see that it occurs only once, in sentence 1. Right click on each word in the \ft line to see which one is the best match for “lyfch”. You will see that both “lily” and “pad” each occur in exactly the same sentences as “lyfch”. Right click on each word in the \tx line to see if any others occur in those same 4 sentences. You will see that none of them do. So we conclude that "lyfch" probably means "lily pad". There are of course many situations in which this type of reference correspondence does not work this easily. Many words have multiple meanings, so that they have various

equivalents in various contexts. Also, many words have different sets of affixes in different languages, which causes them to have a variety of equivalents.

Entering a New Root in the Dictionary Next we will insert "lyfch" into the dictionary. Note that a right click in the \mb line is different from a right click in the \tx line. The \mb line jumps to the dictionary instead of the wordlist. Right click on “*lyfch” in the \mb line. You will see a box titled “No Matches” that says it can’t find “lyfch”. This says that “lyfch” is not yet in the dictionary. One of the options in this box is to insert the word into the dictionary. Choose Insert. You will see the dictionary window in the lower left show a new entry for “lyfch”. Type "n" (abbreviation for noun) into the "\ps" field. Press tab (or down arrow) to move down to the "\ge" field and type "lily.pad". Notice "lily.pad" has a period instead of a space between the words. In interlinear text, a period is used in a multi-word gloss to show that it is a single unit. Now we will return to the interlinear text and see what changes. We will use the "Return from Jump" command to do this. Choose Edit, Return from Jump. You will see the interlinear text updated to show “lily.pad” and “n” under “lyfch”.

Using Word Lists to Look for Equivalents Next we will analyze the word "todn". Right click on "todn" in the \tx line. You will see the wordlist window jump to show it. The references show that it occurs 4 times, in sentences 1, 4, 6, and 7. (Note that if you clicked in the \mb line instead of the text line, it says "No matches".) Next we jump the English wordlist to look for a possible equivalent to "todn" in the free translation. Right click on "long" in the free translation. It occurs only in sentence 1, so it isn't a likely possibility. Right click on "frog" in the free translation. It occurs 5 times, in sentences 1, 4, 5, 6, and 7. To see all 5 references, double click the “frog” line in the word list. This changes the word list window out of browse view into normal view. Change the word list window back to browse view by clicking the browse button. The word “frog” looks like a good possibility for “todn”, except that it has an extra occurrence. It occurs in sentence 5, but "todn" does not. However, we don't let that discourage us because we suspect that Amadup has more affixes than English. So we

assume that the equivalent of "frog" may occur with various affixes. In fact, if you look at "todn" in the wordlist, you can see that immediately above it is "todj", which occurs only in sentence 5. If "todj" also means "frog", then the combination of "todn" and "todj" cover all the references of "frog". So we will assume that "tod" is a root that means "frog". Notice that this combination has given us another important piece of information. It appears that "-n" and "-j" are suffixes that can occur on nouns. We don't yet know what they mean, but we at least have a good idea that they probably mean something.

Entering another Root in the Dictionary Let's enter "tod" into the dictionary. Note that even though “Todn” starts with an upper case letter in the text, you must enter it in lower case in the dictionary. Only proper nouns should be entered in upper case in the dictionary. All other words must be lower case. Right click on "Todn" in the \mb line. You will see the "No matches" dialog box. Click on "Todn" in the dialog box to edit it, and change it to "tod". Click Insert. You will see a new dictionary entry for "tod". Enter part of speech "n" and gloss "frog". Now we will return to the interlinear text and see what changes. Press Ctrl+R to return from jump and update interlinear. You will see the focus return to the interlinear text window, and you will see "Todn" interlinearized again. Its analysis shows "tod" glossed as "frog" followed by "*n". The "*n" says that "n" is the leftovers that cannot be accounted for in the parse. The asterisk says that this is a failed piece that is not in the dictionary. Exit Toolbox.

Exercise 2: Affixes This exercise will show you how to do affixes in interlinear text. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Comparing Similar Sentences Sentences that are similar to each other can be very useful for analysis of a language. For example, sentence 5 is "The fish looked up at the frog." and sentence 6 is "The frog looked down at the fish." These two sentences both have the verb "look", and the both have the fish and the frog, but the fish and frog change places. This should be very fruitful in telling us more information. These two sentences are displayed in the text window.

Interlinearize sentence 5 (Alt+I or the Interlinearize button). Interlinearize again to do sentence 6. Comparing the two sentences gives some very useful information. It looks like the verb is at the end of the sentence. The verb suffix changes between the two, so it appears to show some kind of agreement with the subject of the sentence.

Using a Word List Sorted from the End to Identify Roots To identify roots, it is very helpful to have a word list sorted from the end. Click on the title bar of the Amadup wordlist window (wordlist.db) to put the focus on it. Choose Window, Duplicate to duplicate the wordlist window. Resize the second wordlist window to about the same size as the first, and put it under the first wordlist window. Notice that positioning it this way will make it completely cover the English wordlist window. That is all right. The English wordlist window will reappear whenever we jump to it. Change the sorting of the second wordlist window to sort the words from the end. (Choose Database, Sorting. Turn on "Sort first field from end". Choose OK.) Notice that the browse view shows the list with the ends of the words aligned to the right margin to show that they are sorted from the end. The two word lists can both jump from a single right click. Right click on "bzervegonsew" in the \tx line of sentence 6. The two word lists both jump to it. Comparing the two word lists gives information about how much of the word is root and how much is affixes. Other words have "gonsew" as a suffix, so we will assume "bzerve" is the root. Select "bzerve" in the \mb line and do a right click to insert it into the lexicon. Give it part of speech "v" and gloss "look". Press Ctrl+R to return to the text and update the interlinearization. It shows “bzerve” as a root, and then shows “*gonsew” as the rest of the word. The asterisk before it shows that this is not a successful parse, and that “gonsew” is the part that cannot be parsed. Put the cursor in "bzervegonso" in sentence 5 and interlinearize it. It shows “bzerve” as a root.

Studying Content Words with Multiple Occurences To identify new words, it is useful to look at content words that occur multiple times in the translation. Content words are words like nouns and verbs. Such words usually have their meaning represented in the text. Let's look at "fish".

Right click on "fish" in the free translation of sentence 5. Note that "fish" occurs four times, in sentences 4, 5, 6, and 8. Let's look for something that will reasonably cover those references. Right click on "Don" in the text line. It occurs in sentences 5 and 8. That is partial coverage of the four occurences of “fish”. The previous entry in the wordlist is "doj". These two together cover three of the four occurences of "fish". Right click on "jlyn" in the text line. Like "don", it occurs in sentences 5 and 8. That is partial coverage. Directly above it in the upper word list is "jlyj", which occurs in sentences 4 and 6. The two of those give complete coverage of the occurrences of "fish", so we will propose "jly" as the equivalent of "fish". Enter "jly" in the dictionary with part of speech “n" and gloss "fish". (Remember that to enter a word in the dictionary, you do a right click on it in the \mb line, not in the \tx line.) Press Ctrl+R to return to the text and reinterlinearize. You will see "jlyn" parsed as “jly” (glossed "fish") with an unknown possible suffix of "*n". Interlinearize "jlyj" in sentence 6. It shows “jly” as a root.

Identifying Affixes Now we can do some analysis of affixes. Notice that in sentence 5 fish has a suffix of "-n" and frog has "-j". In sentence 6 the situation is reversed. Frog has "-n" and fish has "-j". What does fish in sentence 5 have in common with frog in sentence 6? They are each the actor, the subject of the translated sentence. So it looks like a good possibility that the "-n" suffix marks the actor or subject. Tentatively, we will call it nominative case, and see if further data will confirm that hypothesis. (Further analysis might show that this language uses and ergative-absolutive case system rather than nominative-objective. If so, we can change the cases to that quite easily.) Do a right click on "*n" in sentence 6 to insert in into the dictionary. Add a hyphen to make it "-n" and give it part of speech "case" and gloss "Nom". Press Ctrl+R to return to the text and reinterlinearize. You will see "todn" parse as "frog -Nom". But notice that the "-n" on "jlyn" in sentence 5 has not changed. This illustrates an important point. The return from jump does not reinterlinearize all occurrences of the word or affix in the text, only the one at the place the jump started. Interlinearize "jlyn" in sentence 5. It parses as "fish -Nom".

Using the same logic, we conclude that "-j" is probably objective case. Follow the steps above to insert "-j" in the dictionary with part of speech "case" and gloss "Obj". Be sure to put the hyphen in front of “-j”. If you forget the hyphen, it will not show as a suffix. Interlinearize "jlyj" in sentence 5 and "todj" in sentence 6. Now that we have nominative and objective case suffixes, we notice that the same suffixes appear on the four words that start with "d".

Identifying a Determiner Interlinearize "don" and "dwj" in sentence 5 and "dwn" and "doj" in sentence 6. They all appear to have the case endings. Each case ending appears to agree with the case ending of the following noun. Based on this evidence, we will conclude that each of these is a noun phrase consisting of a word meaning "the" and a noun. Looking at the four words that mean "the", we see that the only thing they have in common is "d", so we will enter it into the dictionary as "the". Enter "d" into the dictionary with part of speech "det" (determiner) and gloss "the". Interlinearize all four of the "d" words in sentences 5 and 6. The pattern begins to appear. Each noun phrase has "the" with a case suffix followed by a noun with the same case suffix. In each determiner, there is one letter still unknown (marked with an asterisk). We will wait to study that until we have more examples to work with.

Rewrapping Lines in Interlinear Notice that during interlinearization sentences 5 and 6 have gotten longer than the width of the window. We can use the Tools, Reshape command to fix that. Put the cursor at the front of the \tx line of sentence 5 and choose Tools, Reshape. The last word is moved down to another line. Reshape sentence 6. Its last word moves down. Exit Toolbox.

Exercise 3: Interlinear Check This exercise will show you how to use interlinear verify to update interlinearization of a text. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Analyzing all the Rest of the Verbs Sentence 7 and 8 have similiarities that make it useful to compare them. Interlinearize sentences 7 and 8. Notice that sentence 7 starts with "The frog said" and sentence 8 starts with "The fish said". It looks like the word that starts with "sezgons" probably means "said". Right click on "sezgonso" and "sezgonsew" and look at the wordlists to verify that this is reasonable. Make a hypothesis about how much of "sezgonso" and "sezgonsew" is the root. There are other words that end with "gonso" and "gonsew", so we will hypothesize that "sez" is the root for "say". Insert "sez" into the dictionary with part of speech "v" and gloss "say". Interlinearize "sezgonso" and "sezgonsew". We are making progress on the verbs, so we will do some more of them. It appears from what we have done so far that this is a verb-final language, so we will work on that hypothesis. Scroll the text window to the top. Interlinearize sentence 2. Right click on “joygow” in the \tx line of sentence 2. In the wordlist sorted from the end you can see it and just above it "velgow", which is the last word of sentence 1. They both end with the same suffix. So it looks likely that they are the verbs of the sentences. Enter "joy" into the dictionary with part of speech "v" and gloss "like". Enter "vel" into the dictionary with part of speech "v" and gloss "live". For both of them, you should have used Return from Jump (Ctrl+R) to reinterlinearize. If you did not, please get into the habit of always doing that, even if I don't mention it. Scroll the text window to sentence 3. Interlinearize sentence 3. Looking at the last word in sentence 3, you can see that it looks like it has a verb suffix. Also, looking at sentence 2 you can see the same verb "plap" without any suffixes. It looks like "plap" is a verb that means "sit". Enter "plap" into the dictionary with part of speech "v" and gloss "sit". Interlinearize "plap" in sentence 2 and "plapegonsew" in sentence 3. Move to sentence 4 and interlinearize it. Notice that like sentence 2, it appears to have two verbs. If it is like sentence 2, then "wiglo" is a verb meaning "swim". Right click on "wiglo" in the text line to look at the word lists. Notice that other verbs that we have already seen end with "o". Based on this, we will assume that "-o" is a suffix. Enter "wigl" into the dictionary with part of speech "v" and gloss "swim".

At this point we have most of the nouns and verbs in this text analyzed. Next we will look at some other words. Looking at sentences 2, 3 and 4, you can see that "on the lily pad" appears to be "lyfch tap" and "under the lily pad" appears to be "lyfch blo". This looks like "tap" is "on" and "blo" is "under". They appear to be postpositions. Enter "tap" from sentence 3 into the dictionary with part of speech "pp" and gloss "on". Enter "blo" from sentence 4 into the dictionary with part of speech "pp" and gloss "under". We now have the root of every word in sentences 2 and 4. Scroll the window to sentence 1. Looking at it, you can see that "nyr" appears to be a postposition meaning "by". Enter "nyr" into the dictionary with part of speech "pp" and gloss "by".

Interlinear Check and Update This is a good time to do interlinear check to bring the interlinear text up to date with the dictionary. Check and update of interlinear text is done by running Spelling and Interlinear Check, starting at the top of the file. Click on the title bar of text window to make sure the focus is on it. Click on the "First Record" button to move the cursor to the top of the record. Choose Tools, Spelling and Interlinear Check. It stops immediately at "todn" in the first sentence. This means that it has determined that it needs to be updated. In this case, the reason it needs to be updated is that it shows a failure, but in fact it can now be parsed successfully. Press Alt+I to interlinearize "todn". It now parses successfully. Press Alt+C to continue interlinear check. It stops at "tap" in sentence 2. Press Alt+I to interlinearize "tap". It now parses successfully. Press Alt+C to continue check, and Alt+I to parse "tap" in sentence 5. Press Alt+C to continue check, and Alt+I to parse "blo" in sentence 6. Press Alt+C to continue check, and Alt+I to parse "blo" in sentence 7. Press Alt+C to continue check, and Alt+I to parse "joyme" in sentence 8. Press Alt+C to continue check. It stops at "velgow" in sentence 1. Notice that it has returned to the top of the file. This is because interlinear verify does multiple passes. On the first pass it skips over all failed words that still fail. On the second pass it stops at failures that still fail, but may improve their parse guess. On the third pass it stops at all failures. The first two passes help you

update interlinear as you improve your lexicon. The third pass is to verify that a finished interlinear file fully matches the lexicon. If you put the cursor at the very top of the file (at the front of the first field of the first record) before you press Alt+C, then interlinear verify starts pass one. If the cursor is anywhere else in the file, it continues forward with the current pass. When it reaches the bottom of the file, it automatically goes to the top of the file and begins the next pass. You can tell which pass it is in by noticing the kind of places it stops and what happens when you interlinearize again. For example, it if stops at a failure, you know it has finished pass one and is in pass two or three. During your early phases of analysis, you will often stop when pass one is finished. Press Alt+I to interlinearize "velgow". It still fails to parse. Press Alt+C to continue verify. It again stops at "velgow" in sentence 1. This indicates that verify is now in its third pass and has done all the updating it can do at this time. Exit Toolbox.

Exercise 4: Morphology This exercise will show you how to handle morphology and morphophonemics. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

A Phonological Process In a case system like we see here, we can assume all nouns have case. They may not all have explicit case markings, but usually they do. Notice that "lyfch" in sentence 1 is the object of the postposition "nyr", but it does not have an objective case suffix parsed off of it. The normal objective case marker is "-j", and "ch" is the devoiced equivalent of "j". So we hypothesize that "lyfch" is underlying "lyf -j", and that the case marker devoices after an unvoiced consonant. To tell Toolbox that, we add "-ch" as a form of the objective case marker that occurs after an unvoiced consonant. This is done with the "\a" and “\u” markers. Edit the dictionary entry for "-j". You can find it by doing a search in the dictionary window. At the bottom of the entry, add a "\a" field and in it put "-fch". After the “\a” field, add a “\u” field and in it put “f+j”. This is a powerful notation that says when a suffix string of the form “fch” is found, it might have come from underlying “f+j”. It is somewhat like a generative rule backwards, with the surface form above the underlying form. Toolbox does not yet have a feature notation to say “unvoiced consonant”, so we will have to make separate “\a” and “\u”

pairs for for every unvoiced consonant. We will not make them all right now, but we could. Edit the dictionary entry for "lyfch" and change it to "lyf". Reinterlinearize "lyfch" in sentence 1. You will see that it analyzes as underlying "lyf -j". This seems like a good analysis. Use interlinear verify to update all occurrences of “lyfch”. (Go to the top of the file. Press Alt+C for interlinear check. Alternate Alt+I to update interlinear and Alt+C to move to the next place.) When it stops at “velgow” in sentence 1, it will refuse to go on, and you will know that the first pass is finished, and you can stop.

Determiners We now have all the nouns parsed. Next we will look some more at the determiners. Sentences 5 and 6 are very good contrastive sentences for this purpose. Scroll the text window to show sentence 5. Look at the two determiners in sentence 5 and the two in sentence 6 to see if you can see the pattern of the "-o" and "-w" suffixes. You will see that "-o" is used with "fish" and "-w" is used with "frog". What linguistic phenomenon might cause that?

Gender Class Agreement It appears to be some sort of gender class agreement. In a gender class system, the gender is not necessarily shown in an affix on the noun, but rather is inherent in the noun. Gender is marked on other words that agree with the noun in gender. So we will hypothesize that the determiner agrees with the noun in gender. What do we call these gender classes? We have no information on which to call them masculine or feminine. Gender is used in linguistics to refer to any system in which nouns are divided into classes which control agreement. Some languages have many gender classes that have no relationship to masculine and feminine. So for a start, we will just call the frog gender class 1 and the fish gender class 2. Later when we have hundreds of nouns of each gender class in our lexicon, it will be an interesting study to see if we can find common characteristics of each gender class. Having decided that the frog is gender class 1, we will say that the "-w" suffix is an agreement marker for gender class 1. Make a dictionary entry for "-w". Give it part of speech "g" and gloss "1". (This is the digit one.) Since the fish is gender class 2, we will say that the "-o" suffix is an agreement marker for gender class 2. Make a dictionary entry for "-o". Give it part of speech "g" and gloss "2". Interlinearize all the determiners in sentences 5 and 6.

Gender Classes on Nouns in the Dictionary Now that we know that the frog and fish have gender classes, we will want to note their classes in the dictionary. This is important lexical information because it is not predictable from the form or meaning of the noun, but it is needed in order to correctly use the noun. Once a language is known to have a gender class system, then every noun in the dictionary needs to have its gender class noted. There are two ways to note gender class in a Toolbox dictionary. One is to add an extra field for gender class. The other is to add the gender class to the part of speech. The advantage of adding it to the part of speech is that it will show up in the interlinear text. This will help make the basis for gender agreement visible in the text. So, we will add the gender class to the part of speech. Edit the entry for "tod" (frog) and change its part of speech to "n1". Edit the entry for "jly" (fish) and change its part of speech to "n2". Reinterlinearize all the frogs and fishes in sentences 5 and 6. Notice how clearly you can see the gender agreement in the noun phrases.

Gender Class Agreement on Verbs You may have already noticed that in these two sentences that the last suffix on the verb seems to be the same as the gender class suffix on the determiner of the subject. This looks like quite strong evidence that the verb is marked to agree with the gender class of the subject. Reinterlinearize the verbs in sentences 5 and 6. The verbs now show gender class agreement with the subject. Use interlinear verify to update the text. (Be sure to put the cursor at the top of the file before you start. This tells verify to start pass 1 again.) Exit Toolbox.

Exercise 5: Morphophonemics This exercise will show you more about handling morphology and morphophonemics. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Verb Morphology In this exercise we will deal with verb morphology. In sentence 1 the verb has both gender 1 and gender 2 markers cut off of it. But there is no noun of gender 2 in this sentence, so the gender 2 marker must be a false guess by Toolbox. We conclude that "go" is a suffix. (It may be a combination of two, but we will start with the hypothesis that it is one suffix.) Since the translation is in past tense, we will assume the suffix means past. To account for the fact that the translation says "long ago", we will assume that it means distant past.

Since “go” is split apart in the \mb line, you can’t select it and insert it as you have been doing. In a case like that, you can right click on “g” and add the “o” either in the insert box or in the dictionary entry. Or you can delete the space and hyphen between the “g” and the “o” in the \mb line to put “go” back together. This will be only a temporary change because the next interlinearize of “velgow” will overwrite whatever is in the \mb line. Insert "-go" into the dictionary with part of speech "tns" and gloss "DPst". Be sure to put the hyphen onto “-go”. The verb “velgow” now parses as "live -DPst -1". We look down through the other sentences to see where else "-go" appears. Almost all the verbs have "go" or "ego" right after the root. But they also have "ns" or "nse" after "go". By contrasting the first two sentences with the rest, we hypothesize that the difference is that in the first two sentences the action of the verb is stretched over time, whereas in the rest it is punctiliar aspect. In sentence 8 the verb "sezgonso" has both "go" and "ns" clearly present. Move to sentence 8. Interlinearize “sezgonso”. Nothing changes from before. We have “-go” accounted for, so “ns” is all that is left. Insert "-ns" into the dictionary with part of speech "asp" and gloss "Pnct". The verb “sezgonso” now parses as "say -DPst -Pnct -2". Sentence 8 now goes past the right margin. Choose Tools, Reshape (or Shift+F5) to reshape sentence 8. Now it fits within the width of the window.

Morphophonemics on Verbs Move to sentence 7, interlinearize "sezgonsew". It does not succeed. The problem is the "e" between "ns" and "w". We hypothesize that it is phonological rather than another morpheme, on the basis that there is no other morpheme present in "sezgonso", which appears to differ only in case agreement. We hypothesize that the "e" is epenthesized. (It might instead be present in the underlying form and deleted in other contexts. Other data will prove or disprove this hypothesis.) In Toolbox we usually account for morphophonemics on the outer morpheme instead of the inner one. So in this case we will put the alternate form in the entry for "-w". Earlier, we used both an “\a” field and a “\u” field to do this. But it is possible to use a “\a” field alone without a corresponding “\u” field. If an “\a” field is not followed by a “\u” field, then it uses the “\lx” field as an underlying form. This does not constrain the environment of the alternate form, which may cause extra ambiguities in parsing later. In general it is better to use a “\u” after each “\a”, because it constrains the environment and reduces the number of ambiguities. But if you do not yet know the conditioning environment, you can use a “\a” without a “\u”. Later when you have analyzed the environment, you can adjust the “\a” and add a “\u” to constrain it better. Edit the entry for "-w". Add an alternate (\a) field with "-ew".

The verb “sezgonsew” now parses as "say -DPst -Pnct -1". Use interlinear verify to update the text. (Go to the top and press Alt+C.) After a number of updates, the third pass stops at “plapegonsew” in sentence 3. It contains "*e" as a failed piece. We will hypothesize that as another example of epenthesis. Edit the entry for "-go". Add an alternate (\a) field with "-ego". The verb “plapegonsew” now parses as “sit –DPst –Pnct -1”.

A Third Gender Use interlinear verify to update the text. The first place it stops is at “detn” in sentence 7. That means that at this point we have every word in sentences 1 through 6 fully analyzed! This is great progress. Look at sentence 7. Note that there is a literal translation that helps explain the meanings of the words. From this it appears that “klime” means “weather”. Next we observe that “detn” appears to mean “the”. It has the nominative suffix, but it doesn’t have the “w” or the “o” gender suffix. It has “et” in the gender position, so we assume it must be showing a third gender class. We will look a bit more before we enter it into the dictionary. Looking around a bit more, it appears that "doshusten" means "good". It appears to be a predicate adjective. Since determiners agree with their head noun in case and gender, we will hypothesize that adjectives do also. We can see the case agreement suffix "-n". The gender agreement suffix on "detn" is "-et". On "doshusten" we see the "t", but not the "e". But since we have already hypothesized two cases of "e" epenthesis, we will analyze these the same way. Therefore, we will give gender 3 an underlying form of "-t". Insert "-t" into the dictionary with part of speech "g" and gloss "3". Include an alternate (\a) of "-et" for “detn”. The word “detn” parses as “the -3 -Nom”. Assuming that the "-t" on "doshusten" is a gender class marker, then the root is "doshus". Insert "doshus" into the dictionary with part of speech “adj" and gloss "good". It parses except that it shows "*te" as a failed piece. To account for this we will put the epenthesized "e" on the outer morpheme, the "-n" nominative case marker. Edit the entry for "-n" (nominative case) and after the \ge line add an alternate (\a) of "-en". The word "doshusten" now parses as "good -3 -Nom".

Finishing the Analysis of the Text Look at “klime” and see if you can tell what gender it is. Yes, you can see from the determiner in front of it that it is gender 3. Insert “klime” into the dictionary with part of speech “n3” and gloss “weather”. The only word unparsed in sentence 7 is "eh". We will hypothesize that it is a question

marker. We will give it a part of speech of "part" for "particle". Insert "eh" into the dictionary with part of speech "part" and gloss "Ques". Use interlinear verify to update the text. The only failures left are in sentence 8. Move to sentence 8. First we will handle the "me" on the end of "joyme". We will hypothesize that it is an affix that means first person singular. Insert an entry for "-me" and give it part of speech "pers" and gloss "1s". The word “joyme” now parses as "like -1s". The only two failures left are “dukyt” and “dukytch”, which look like they come from the same root. We will assume they come from a root that means “wet”. Insert an entry for "duky" and give it part of speech "adj" and gloss "wet". The word “dukyt” parses as "wet -3". Interlinearize "Dukytch". It parses as "wet -3 -Obj". It appears that a more literal translation of what the fish said is, “Wet! I like it wet!” Use interlinear verify to update the text. It doesn’t stop anywhere in the text. It just says “Spell Check is complete.” This means that there are no failures, and that everything in the text matches the dictionary.

Grammatical and Morphological Summary We have now completely parsed this text! We have learned a great deal about the morphology and syntax of the Amadup language from just one short text. Here is a possible summary of our results: This is a head-final language with postpositions. It seems to be SOV. Nouns have gender and are marked with case. There are at least 3 genders. Determiners and adjectives agree with their head noun in gender and case. The main verb of a clause agrees with the subject in gender, and may have a person suffix. Verbs may have tense and aspect suffixes. Partial paradigms: Case: Nom -n

Obj -j Gendr: 1 -w

2 -o 3 -t

Tense: DPst -go Aspect: Pnct -ns Person: 1s -me I made the paradigms by browsing a view of the dictionary that is sorted by part of

speech (and gloss) and filtered for entries that contain a hyphen. Each paradigm is formed by listing all the entries that have the corresponding part of speech. To see that view, duplicate the dictionary window. Sort the new window by \ps and \ge. Make a filter named “Affixes” that says the \lx field must contain a hyphen. (To make this work, you must set "Match characters" at the bottom of the filter window to "normally Ignored".) Turn on the filter. Set browse fields to \ps, \ge, and \lx. You will see a summary of all the affixes in the paradigm above. Exit Toolbox.

Exercise 6: Adding a New Text This exercise will show you how to bring in and set up a new text. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Setting up a New Text The file “Frog Races Fish.txt” is another text in the Amadup language. The easiest way to do a new text is to make it another record in the general texts file, rather than opening a new file. Often you will type new text directly into Toolbox. But you can also use cut and paste bring in text that has been typed into another program. For this example, we will bring in the text from “Frog Races Fish.txt”. Before we bring in the text, we need to make a new record for it. Be sure the focus is on the text window. Choose Database, Insert Record. You will see a box titled “Insert Record”. Type the name of the text “Frog Races Fish”. Choose OK. The text window will change to show an \id line with “Frog Races Fish” and a \ref and a \tx line.

Bringing in a New Text Next we will bring in the text. This will require opening the text file outside of Toolbox. Open Windows Explorer and navigate to “Toolbox Training\2 Text Analysis\6 Adding a New Text”. Double click on the file named “Frog Races Fish.txt” to open it in Notepad. You will see a text that has no backslash markers and no free translation. Select the entire text. You can use Edit, Select All for this if you want to.

Choose Edit, Copy. This copies the text to the clipboard. Close Notepad. Put the focus back on Toolbox. The focus should be on the text window. Put the cursor after the \tx marker. Choose Edit, Paste. The text is copied into the \tx field.

Numbering a New Text The first thing we want to do with a new text is to give it reference numbers. Put the cursor at the beginning of the \tx field. Choose Tools, Break/Number Text. You will see a dialog box with information about how the text breaking and numbering will be done. Notice that it says it will only do the current record, and that it will break text at period, question mark and exclamation mark. It says it will process only \tx fields. It says it will start with number 001. One thing we need to change is that we don’t want it to use the full name of the text for references. References work much better if they are kept short. So we will tell it to use a short name instead of the full contents of the \id field. Click on “Use this Name”. A box opens for the name. Put the cursor into the box for the name and type “Race”. Everything else is OK, so we are ready to do the numbering. Click OK. You will see the text broken into sentences with a reference number above each sentence.

Renumbering a Text There are a few places where we don’t want quoted text broken apart, so we will delete a few references to keep short quotes together. To delete a reference, you can do a left click on the marker to select the whole field, and then press Delete to delete the field. Delete \ref Race.007. Delete references 9 and 13. After we delete or add references, we adjust the numbering of the text. Choose Tools, Renumber Text. The default values are correct. Choose OK. Note that if you renumber text after you have made a word list, you should rebuild the word list to adjust for the changed references.

Adding Free Translation You will normally want a free translation of a text before you try to analyze it. You will probably get that from the person who gave you the text. You will usually type it into the text after you have broken the text into units. It will be easiest if you type each new text directly into Toolbox, because then you can type the free translation of each sentence at the same time. But for this example, I am illustrating how you can bring in a text that was already typed in another program. To show how it works to add a free translation to an existing text, we will add the first one. Place the cursor at the end of the \tx line of Race.001. Press Enter. You will see a \ft marker appear. Type “Do you remember the frog and the fish?” Each free translation can be entered this way. Place the cursor at the end of the \tx line of Race.002. Press Enter. Again, you will see a \ft marker appear. To save you from typing all the free translations, the next exercise will begin with them all typed in. Exit Toolbox.

Exercise 7: Analyzing a New Text This exercise will show you how to analyze a new text. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows. The text window shows the text titled “Frog Races Fish”. Scroll down the text window and read the free translation of the text. Scroll back up to the top of the window.

Interlinearizing the Text Interlinearize sentence 1. Most of the words parse. You can see that the word "en" appears to mean "and". Insert "en" into the dictionary with part of speech "conj" and gloss "and". (It is split apart in the \mb line, but you can delete the hyphen and space to put it together so you can jump to it for the insert.) The root "kel" appears to mean "remember". Insert "kel" into the dictionary with part of speech "v" and gloss "remember". Look carefully at the analysis and think about what we know of the verb morphology of

this language. You will realize that verbs cannot take case markers, so this is a wrong parse of this verb. I won't give the details here, but it would be possible to hypothesize from the data in this text that "-j" on a verb is the marker for second person singular. It is what causes this sentence to refer to "you". So we want to insert a dictionary entry for "-j" as "2s".

Inserting a New Entry of the Same Form as an Existing One Right click on "-j". The dictionary jumps to the existing entry for "-j". It didn't give us a chance to insert a new one. We will have to force it to let us do that by telling it to insert a new record. Choose Database, Insert Record. It should default to inserting "-j" because the cursor was on that. Choose OK. You will see the Multiple Matches dialog box. It is saying that there already is a "-j". Choose Insert. You will see a new entry for "-j". Enter part of speech "pers" and gloss "2s". Return from jump (Ctrl+R) to interlinearize "kelj".

Ambiguity Selection You will see a box titled Ambiguity Selection. It shows "Obj{case}" and "2s{pers}". This is telling you that the affix "-j" is ambiguous between these two meanings. You can see that the "-j" in the text is highlighted. (You may need to move the ambituity selection box down so it does not cover the highlighted text.) You can also see that it is on the verb "remember". This tells you that it is a person marker, not a case marker. In the ambiguity box, double click on "2s{pers}". You will see the box close and the text interlinearized. The first sentence is now completely parsed. Interlinearize sentence 2. The sentence will be parsed except for two words. The word "O" appears to be the sentence introducer translated as "well". Enter "o" in the dictionary with part of speech "part" and gloss "well".

A Fourth Gender The other word that fails is "defch". Notice that this word is a determiner on "lyf" (lily pad), so we will now get to find out the gender of "lyf". Sadly, it appears that "lyf" is not one of the three genders we already had, so we will have to make it gender number 4. It appears that its gender marking is "-f". (Since the "-t" gender epenthesized an "e" in the determiner, we will hypothesize that "-f" does the same.) Insert "-f" into the dictionary with part of speech "g" and gloss "4". Give it an alternate form of "-ef".

Return from jump (Ctrl+R) to interlinearize "defch". Choose "Obj{case}" in the ambiguity box. Now that we know "lyf" is gender 4, we can put that information into its part of speech. Edit the entry for "lyf" and change its part of speech to "n4". Return from jump (Ctrl+R) to interlinearize. Choose "Obj{case}" in the ambiguity box. Sentence 2 is now completely parsed. Interlinearize sentence 3. Choose "Obj{case}" for every ambiguity. Everything parses. This is an example of how parsing can recognize combinations of roots and affixes that have not been seen before. Interlinearize sentence 4. Choose "Obj{case}" for every ambiguity. Everything parses except for “fo” and "Zumtaj". Notice that "fo" has a parse guess of two gender suffixes "-4 -2". This is clearly wrong. It appears that "fo" is really a postposition that means "to". Insert "fo" into the dictionary with part of speech "pp" and gloss "to". It analyzes correctly. Because the whole word is found in the dictionary, Toolbox doesn't try to cut it into pieces. The word "zumtaj" doesn't parse because we don't have the verb yet. From other data in this text, we could hypothesize that "-ta" is a future tense marker. So that means that "zum" is the root. Insert "-ta" into the dictionary with part of speech "tns" and gloss "Fut". Insert "zum" into the dictionary with part of speech "v" and gloss "race". Return from jump (Ctrl+R) to interlinearize. Choose "2s{pers}" in the ambiguity box. Sentence 4 is completely parsed. Exit Toolbox.

Exercise 8: Word Formulas This exercise will show you how to use word formulas to reduce ambiguities. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Word Formulas By now you must be getting tired of always having to choose the ambiguity between the "-j" objective case marker and the "-j" second person marker. Toolbox can use word formulas to reduce this type of ambiguity. Word formulas are descriptions of the morphology of words in the language. They can be used to say that verb affixes occur only on verbs and noun affixes occur only on nouns. To make a new set of word formulas, we go to the interlinear page of the database

properties. Choose Database, Properties, Interlinear. Choose Modify. You will see a box titled Parse. At the lower left is a check box titled "Enable word formulas". Turn on "Enable word formulas". You will see that the "Formulas" button at the lower right becomes available. Click Formulas. You will see a box titled "Word Formulas". Click Modify. You will see a box that says Symbol is "Word". It has an open box titled "Patterns (one pattern per line)". In that box we will put a formula for each kind of word. The formulas are made up of parts of speech. Type "det g case". This says that a word may consist of a determiner followed by a gender marker and a case marker. Press Enter and type "adj g case". This says that another possible word is an adjective with a gender marker and a case marker. We will put in a formula for every type of word we have seen. If a part of speech is in parentheses, that means it is optional. On the verbs, we have two alternatives, one with a gender agreement marker, and one with a person marker. Type the following set of formulas: (The first two you already did.) det g case adj g case Noun case v (tns) (asp) g v (tns) (asp) pers part conj pp Notice that "Noun" is in upper case, and that it is not a part of speech. That is because it is an intermediate symbol that we are using to stand for any of the 4 genders of nouns. We use a capital letter to distinguish it from an actual part of speech. We will define "Noun" as any of the 4 genders of nouns (n1, n2, n3, n4). Click OK. You will see the box titled "Word Formulas". We will add a formula for "Noun". Click Add. Enter "Noun" as the Symbol. Under Patterns, enter the following: n1

n2 n3 n4 This says that a Noun can be any of the 4 noun genders. Click OK. You will see the box titled "Word Formulas". You will see that we now have symbols for "Noun" and for "Word". The symbol "Word" has a "P" to its left, which indicates that it is the primary symbol, the highest one. This means it is the symbol that Toolbox starts with when it tries to compare the structure of a word to the word formulas. We have now entered our set of word formulas. The formulas cover all the parts of speech that we have entered into the dictionary so far. Click OK three times to return to the text. Now we will test our word formulas on some words. If they work right, we should no longer see the ambiguity box. Interlinearize "zumtaj". It should show the same annotations it had before. Interlinearize "doj", "jly", and "sezgonsew". They should all show the same annotations as before. The ambiguity box should not appear. If the ambiguity box appears, or the annotations change, go back and check your word formulas carefully to be sure they are exactly as given above. Note that upper and lower case are significant. (If you can't get the formulas to work, then go on to the next exercise, which starts from here with the formulas working. Look at the formulas in that exercise to see what is different from what you did here.) You can see that word formulas can greatly reduce the number of ambiguities you have to choose by hand. They can't get rid of all ambiguities, but they can get rid of a lot of them.

Analyzing the Rest of the Text Interlinearize sentence 5. Looking at the sentence, we can see that “watr” has a gender 4 determiner, so it must be a gender 4 noun. The other failed words are quite easy to gloss. Insert “watr” with part of speech “n4” and gloss “pond”. Insert “ovr” with part of speech “pp” and gloss “across”. Insert “je” with part of speech “pron” and gloss “I”. Interlinearize sentence 6. Insert “zumy” with part of speech “adv” and gloss “fast”. Insert “abel” with part of speech “v” and gloss “can”. Interlinearize sentence 7. Insert “yep” with part of speech “part” and gloss “OK”. Insert “ren” with part of speech “v” and gloss “go”.

Interlinearize sentence 8. The verb doesn’t parse. It appears to have a suffix we have not seen before. We will look at another sentence before we decide what to do with it. Interlinearize sentence 9. The verb “bzervegomo” looks like it has the same extra suffix as the verb of sentence 8. We will analyze it as iterative aspect. Insert “-m” with part of speech “asp” and gloss “Iter”. Insert “evwer” with part of speech “adv” and gloss “around”. Interlinearize sentence 10. Insert “ya” with part of speech “part” and gloss “ha”. Insert “em” with part of speech “pron” and gloss “he”. Insert “noprob” with part of speech “adv” and gloss “easily”. Insert “mash” with part of speech “v” and gloss “beat”. Interlinearize sentence 11. Insert “hevr” with part of speech “conj” and gloss “but”. Insert “af” with part of speech “pp” and gloss “from”. Insert “shor” with part of speech “n4” and gloss “bank”. Insert “lyp” with part of speech “v” and gloss “jump”. Interlinearize sentence 12. Insert “hol” with part of speech “adj” and gloss “whole”. Interlinearize sentence 13. Insert “gin” with part of speech “conj” and gloss “when”. Insert “far” with part of speech “adj” and gloss “far”. Insert “arav” with part of speech “v” and gloss “arrive”. Insert “b” with part of speech “v” and gloss “be”. Interlinearize sentence 14. The verb has an affix not seen before. We will analyze it as recent past. Insert “-ty” with part of speech “tns” and gloss “RPst”. Interlinearize sentence 15. Insert “amung” with part of speech “pp” and gloss “through”. Interlinearize sentence 16. It has no failures.

Correcting a False Analysis Interlinearize sentence 17. It shows an ambiguity box that asks you to choose between two parses. But neither of them is correct. Notice that both have an asterisk to the left, which says that they did not

pass the word formulas. One has a case marker on a verb, and the other has two person markers. Choose the analysis with two person markers, just temporarily until we fix it. Looking at “wiglmej” it appears that it should parse as “swim -Iter -2s”. Iterative aspect is “m”. But there is an extra “e” between the “m” and the “j” of “-2s”. To get the right parse we have to account for that “e”. As usual, we will put it on the outer affix, so it goes in the entry for “-j”. Do a right click on “-j”. You will see the Multiple Matches dialog box. Select the “-j” that has a gloss of “2s”. At the end of the entry add “\a -ej”. Return from jump (Ctrl+R) to interlinearize. It now parses correctly as “swim -Iter -2s”. We have now completely parsed this second text! Exit Toolbox.

Exercise 9: Morphological Summary This exercise will show you how to update your grammatical and morphological summary. It also gives a fully parsed version of the second text in case you had any trouble in the previous exercise. Open the exercise project. Toolbox should show four windows, one titled "Frog Meets Fish.txt", one titled “Amadup.dic”, and two word list windows.

Updating the Grammatical and Morphological Summary Now that we have parsed a second text, we can upgrade our summary of the morphology and syntax of Amadup based on the new information from this text.

Seeing Dictionary Data Organized for the Summary Look at the expanded paradigm by duplicating the Amadup.dic window, sorting it by “ps”, “ge” and “lx”, and browsing by the same 3 fields. Use the “Affixes” filter. This gives us a summary of all of the affixes. Also scroll through the text and look at the word order in phrases and clauses.

Writing the Summary Here is a possible updated summary: Amadup is a head-final language with postpositions. It seems to be SOV. Nouns have gender and are marked with case. There are at least 4 genders. Determiners and adjectives agree with their head noun in gender and case. The main verb of a clause agrees with the subject in gender, or has a person suffix. Verbs may have tense and aspect suffixes.

Partial paradigms: Case: Nom -n

Obj -j Gendr: 1 -w

2 -o 3 -t 4 -f

Tense: DPst -go RPst -ty Fut -ta Aspect: Pnct -ns Iter -m Person: 1s -me 2s -j We could write a more thorough summary with examples from this text if we wanted to. Exit Toolbox. As I am sure you have realized, I am the only speaker of the Amadup language. (Its name stands for “A Made-Up Language”.) I had great fun making these exercises. I hope you have found them interesting, fun, and enlightening. I hope you will enjoy using Toolbox and be very successful with it! Alan Buseman

instructions fortext analysis exercises - l&s...

Documents