unsupervised machine learning software for morphology ... · atif akhtar computer science session...

135
The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)_________________________________ Unsupervised Machine Learning software for Morphology Challenge Atif Akhtar Computer Science Session 2008 / 2009

Upload: hoangnhan

Post on 11-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

The candidate confirms that the work submitted is their own and the appropriate credit has been

given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may be

considered as plagiarism.

(Signature of student)_________________________________

Unsupervised Machine Learning

software for Morphology

Challenge

Atif Akhtar

Computer Science

Session 2008 / 2009

Page 2: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Summary

This project is based on the Morphology Challenge, in particular the 2005 challenge. It aims to

develop an Unsupervised statistical machine learning software that will be able to segment an

input list of words into morphemes.

There are 2 phases to the implementation and within each there are 7 algorithms, each is a

refinement of the previous one. At the end the scores from the evaluations on the algorithms are

compared and a conclusion made.

Page 3: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Contents Page

1. Introduction

1.1 Aim

1.2 Relevance to Degree

1.3 Morphology Challenge

1.4 Minimum Requirements

1.5 Methodology

1.6 Schedule

1.7 Objectives

2. Background Research

2.1 Introduction2.2 Unsupervised Morphemes Segmentation

2.3 SUMA-Simple Unsupervised Morphology Analysis Algorithm

2.4 A Simpler, Intuitive Approach to Morpheme Induction

2.5 Starting Point

3. Implementation/Evaluation

3.1 Phase I

3.1.1 X1

3.1.2 X2

3.1.3 X3

3.1.4 X4

3.1.5 X5

3.1.6 X5

3.1.7 X7

3.2 Summary of Evaluations

3.21 Issues Encountered

3.3 Phase II

3.31 Summary of Development

3.32 XX1

3.33 XX2

3.34 XX3

3.35 XX4

Page 4: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

3.36 XX5

3.37 XX6

3.38 XX7

4. Summary of Evaluations

4.1 Extensions/Issues Encountered

4.2 Conclusion

Appendix

A Personal Reflection

B Plan of Schedule

B2 Actual Plan

C Shows the comparison of the 4 major algorithms.

D Diagram shows flow of algorithm for the main program

E Example Tables

F Diagram shows the running of the XX7 algorithm.

G Diagram shows the word aftercare’s

H Shows the evaluation results of all the algorithms

1 Evaluation List D1

2 Sample Result List of X2

3 Sample Results X1

4 Sample Result List of X2

5 Sample Result of X3

6a Sample of Prefixes X4

6b Sample of Results from running X4 on Evaluation Set D1:

6c Sample of Results from running X4 on Evaluation Set D2:

7a Sample of Prefixes X5

7b Sample of Results from running X5 on Evaluation Set D1

7c Sample of Results from running X5 on Evaluation Set D2:

8a Sample of Prefixes X6

8b Sample of Results from running X6 on Evaluation Set D1

8c Appendix 8c, Reversed Sample of Results of X6 on D2:

8d Re-Reversed Sample of Results of X6 on D2:

9a Sample of Prefixes X7

9b Sample of Results from running X7 on Evaluation Set D1

9c Sample of Results from running X7 on Evaluation Set D2

Page 5: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

10 Results of algorithm XX1

11 Results of XX2

12 Results list for XX3

13 Results list for XX4

14a Results for XX5

14b Remainder List

15a Results for XX6

15b More Remainder List

16 Results for XX7

17 Prefix List used in Algorithms XX1-XX717b Suffix List used in Algorithms XX1-XX7

Page 6: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

1.1 Aim

The aim is to develop Unsupervised Statistical Machine Learning software that will be able to

segment words of a language into the smallest possible meaningful segment I.e. Morphemes.

Morphemes are commonly known as basic vocabulary units that could be used for different tasks

such as text understanding, machine translation, statistical language modeling and information

retrieval.

1.2 Relevance to degree:

This problem relates to the Artificial Intelligence division of my degree. The modules involved

include: AI22; Fundamentals of Artificial Intelligence, AI32; Natural Language Processing and

SE20; Object Oriented Software Engineering.

This problem is taken from the International Morphology Challenge competition that has been

running for the last 4 years.

1.3 Morphology Challenge

The Morphology Challenge is basically what’s been described above but it has variations and

follows a methodology. The general outline of the methodology is that participants are given a

list of words along with their frequencies that have been collected from the corpora of the

relevant language. They are then to use this information to construct an algorithm that can at the

simplest challenge, segment the words as best as possible into morphemes (2005 challenge). The

accuracy of the software would be measured by an evaluation script provided for each challenge

which compares the output of the software with a gold standard provided by the organizers of

Morphology Challenge.

So for example if the word list for the 2005 Morphology Challenge contained:

Reading

Read

Reads

Red

Rope

Then the output expected from the program would be:

Read ing

Page 7: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Read

Read s

Red

Rope

As the years of running the challenge went on, the challenge increased in complexity, as the

challenges went on, further languages were added and also further analysis was required upon the

word list I.e. just segmenting to morphemes would not be the goal. In the case of Morphology

Challenge 2007, the words had to be classified into a category, for example, if the words in the

list were:

Boots

Boot

Foot

Feet

The output expected would be:

Boot +plural

Boot

Foot+plural

Feet

1.4 Minimum Requirements

• To derive a suggested affix list from the English dataset of words.

• To produce a piece of software that will be able to segment the words in the English

dataset based on the affix list I.e. to follow the Morphology 2005 challenge.

• To evaluate the output from the software using the Evaluation script provided by the

Morphology Challenge organizers. This is vital to judge the accuracy of results.

• To comply with the rules of the Morphology Challenge I.e. the software must be

unsupervised and be language independent so code relating to the English grammar can

not be used. It must be a generic solution

1.5 Methodology

The first thing to do will be to have a look at the proceedings of the 2005 Morphology challenge

contest. From this an insight can be gained into how the participants went about constructing a

solution. It will help to generate ideas on how to make a start on the design of the solution. A

rapid proto-typing approach will be taken as opposed to an incremental stage approach (Waterfall

Page 8: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Method). After implementing the basic algorithm, it will be refined numerous times in order to

get better results. There will be 3 phases for each iteration:

1. Design - This will involve generating goals on what the algorithm is expecting to do.

2. Implementation - This involves actually implementing the algorithm in Java.

3. Evaluation - This will be similar to a reflection, what was gained from implementing the

algorithm, was it a success? This will also indicate on how to progress to the next Prototype.

Thus the evaluation and implementation will be running concurrently. It is felt that this is a good

approach as the project is quite practical, so reading up great amounts of theory will not prove to

be very productive, a practical approach will give quicker results and reflections on the algorithm

wherefrom improvements can be made. It will also highlight the possibilities of what can

actually be implemented rather than doing heavy designing and reaching a stage where the

algorithm may not be able to be implemented.

1.6 Schedule

Appendix B displays the planned schedule at the start of the Project and Appendix C displays the

actual plan. The actual compared to the plan looks quite different but the implementation was

split into two parts overall, so Figure 1 in Appendix C serves as the first iteration described in the

plan in Appendix B and Figure 2 in Appendix C serves as the second iteration. It is to be noted

that in total there were 14 iterations, 7 in the first part and a further 7 in the second.

List of Milestones

Task Start FinishMid-Term Report 15/11/2008 19/12/2008Completion of Phase 1 14/01/2009 10/03/2009Project Presentation 18/03/2009 18/03/2009Completion of Phase 2 10/03/2009 12/04/2009Final Write-Up 14/04/2009 29/04/2009

1.7 Objectives

As the project is based on the Morphology challenge, The Morphology Challenge website

provides plentiful resources if not sufficient for what needs to be achieved.

The first step of research will be to investigate the website and to read through the proceedings of

the participants over the number of years. This will help to stimulate ideas of how to go about

structuring an initial algorithm.

Once the background research phase is completed the objective is to come up with a basic,

perhaps vague algorithm that would be able to explore the semantics of segmenting a dataset of

Page 9: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

words and then furthermore to implement this software in the programming language java.

The reason Java was chosen as the programming language was firstly due to the familiarity with

it, secondly as Java is object oriented it provides greater flexibility and reusability. It also has a

very simple and clean structure which would make it easier to write the program.

2. Background Literature Review

2.1 Introduction

The term ‘morphology’ was first coined by the German poet, novelist, playwright and

philosopher Johann Wolfgang early in the nineteenth century in a biological context. Its

etymology is Greek: morph-means ‘shape, form’ and morphology is the study of form of forms.

In linguistics “morphology refers to the mental system involved in word formation or to the

branch of linguistics that deal with words, their internal structure, and how they are formed”.

(Aronoff. A and Fudeman. A. (2005) pp.1-2 )

Though words are generally considered as being the smallest units of syntax, it is clear that in

most languages, words can be related to other words through rules. For example, English

speakers understand that the words dog, dogs, and dog catcher are closely related. English

speakers recognize these relations from their tacit knowledge of the rules of word formation in

English. The rules understood by the speaker reflect specific patterns in the way words are

created from smaller units and how those smaller units work together in speech. In this way,

morphology is the branch of linguistics that studies patterns of word formation within and across

languages, and attempts to devise rules that model the knowledge of the speakers of those

languages.

Every language has its own unique structures. Beginning with the sound system to meaning

(semantics), they form the foundation of a language. Acquiring a language implies acquiring all

those structures.

“Morphology is an area that studies structures, forms and categorizations of words.”

(Jalaluddin. N. H. (2008) Pg.109)

Affixes.

Malay has pre-fixes, suffixes, circumfixes and infixes while in English pre-fixes and suffixes are

more prominent. The difference between Malay and English affixes is English affixes can

indicate or produce negative meanings, for example im-, dis-, mal- and ir-. These affixes

transform the positive meanings into negative. We have possible and impossible or obedient to

disobedient. This phenomenon does not exist in Malay.

Prepositions exist in both Malay and English. However, its usage may sometimes be influenced

Page 10: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

by culture.

Plural inflections- ‘s’ and ‘es’.

Inflections are affixes added to a root word to indicate a grammatical meaning. In English, -s is

added to book- books to indicate plurality, -ed as in walked or talked to indicate past tense.

Inflection, however, does not exist in Austronesian languages, including Malay. In English there

are three markers to indicate plurality- -s, -es, and –ies. Plural inflection becomes more

complicated when it is influenced by phonological rules. For words ending with consonant /h/ ,

its plural form is inflected with –es, for example ostrich- ostriches. However, this does not occur

with words that end with /t/ as in accident- accidents.

Compared with Malay language, plurality is indicated by cardinal or ordinal words. Some

examples of Malay cardinal words are semua (all), sebahagian (some) and tiap (every) while

ordinal words are kedua (second), ketiga (third), keemat (fourth) and many others (Asmah 1986).

Plurality can also be indicated by the pre-fix ber- to words of measurement, which then undergo

reduplication process, for example – berjam-jam (hours), berhari-hari (days after days),

berbulan-bulan (month after month) and many others. It is clear that Malay language and English

have different forms to indicate plurality.

Adverbs are easily identified in English with the –ly marker as the clue.

Syntax

Syntax is one of the main areas of linguistics in which sentence structures and patterns are

analyzed. Although English and Malay share the basic structure that is ‘subject-verb-object’,

there are numerous other differences between the two languages such as the usage of copula ‘be’,

subject-verb agreement, articles, determiner and relative pronouns.

Copula ‘be’.

Within the English grammatical system, the form of copula ‘be’ is vital within a sentence

as it links the subject of a sentence with a predicate. There are three forms of copula ‘be’, for the

present tense specifically ‘am’, ‘is’ for the third person singular subjects and ‘are’ as well as ‘you’

for plural ones. As for the past tense form ‘was’ is used for singular subjects (I, he, she, it) while

‘were’ is for plural subjects (you, we they) including ‘you’ in the form of second person singular.

Determiners.

In the grammatical arrangement of English language, indefinite and definite articles a, an, are two

of the differing types of determiner that are applied to premodify a head noun in a noun phrase.

Relative pronouns

In English relative pronouns are used in association to connect one article to another. Relative

pronouns refer to nouns that have been cited previously in the article or sentence. The following

Page 11: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

are the 5 types of relative pronouns in the English language: that, which, who, whom, and whose.

‘Who’, ‘whose’ and ‘whom’, are made use of when referring to people. ‘Which’, is used to refer

to things, place or idea and ‘that’, can be used to refer to people or things.

(Jalaluddin. N. H. (2008) pp.109-115.)

The main focus of the background reading were the proceedings from the Morphology Challenge,

in particular the 2005 one.

A few previous approaches have been described in (Goldsmith, 2001) but there are generally 4

ways of attempting a solution.

1. Identification of morpheme boundaries using transitional probablities

2. Identification of morpheme internal bigrams or trigrams

3. Discovery of relationships between pairs of words.

4. Information theoretic approach to minimize the number of letters in the morphemes of the

language.

2.2 Unsupervised Morphemes Segmentation

The first paper that was examined was (Rehman and Hussain 2005). The approach laid out in this

report was of two stages:

• Learning Model

• Segmentation

The learning model phase is where a model is built up from a corpus from which a list of

morphemes can be derived and the second stage being where these morphemes are used to

segment the words. It is described in this paper how there are two basic types of morphemes,

roots and affixes. The root is the main part of the word and the affixes are prefixes and suffixes.

An important point to be observed in the paper is that limits on the length of words has been set.

For a part of a word to be qualified as a prefix or suffix it has to be at maximum length 3. On the

other hand the length of a root morpheme is limited to 13 characters.

This Is something to keep in mind, the purpose of setting limits is to prevent complications and to

exclude anomalies, better understanding would be had of exactly what’s being analysed. It

focuses the sample set between known thresholds. Though this could help the understanding

factor and be a good help in debugging initially, the words that would be beyond the threshold

would not be analysed and hence would affect the reliability of the solution.

The learning model was implemented in Microsoft Access and it took the most frequent words as

possible morphemes from a given corpus. The learning model however will not extract words

Page 12: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

that have a frequency less than 7 in the corpus, this can be seen as another control mechanism.

The reason why such mechanisms could be so important is that their main purpose would be to

refine the solution, to give better results, it may be that mistakes have been made in the spellings

of words and the frequency of such words would presumably be quite low so by filtering out

some of the infrequent words it could possibly get rid of some noise. However in the proceeding

the purpose of using such a measure was to speed up the processing.

This also leads to another issue, running time. How long would the software take to extract

morphemes from a given dataset?

Another factor to consider is what affect adding limits on the morphemes would have on the F-

Measure at the end?

An example is given in the paper as to how the algorithm operates, given a set of words:

1. Ab

2. Abacus

3. About

4. Abreast

5. Again

6. Bargain

The algorithm starts off with the first letter of the first word and would then check the occurrence

of that letter within the list. The occurrence of ‘A’ in the list at the beginning of a word is 5. The

program then searches to find ‘A’ as a separate word in the list on its own. If it is found then it

accepts ‘A’ as a possible segmentation point, otherwise ignored.

The algorithm then moves onto the second letter and looks to see the occurrence of the word ‘ab’

in the rest of the list. It finds it 4 times but it also finds ‘ab’ as a word on its own therefore it adds

‘ab’ to an empty list of valid segmentations which it can use in the segmentation phase. Like this

the algorithm makes its way through all the entries of the word list and either adding them to the

list of valid segmentations or ignoring them.

Segmentation Phase

The segmentation phase is implemented in Visual Basic. The segmentation model aims to

separate the prefix, suffix and root morpheme of each word. Firstly the algorithm checks each

character of a word from the first till last. If any part of it is found in the valid segments list and

also the remaining characters are found in the list, then the separated segment will be treated as a

possible prefix. The rest of the string is then passed onto the suffix assessing part which carries

out a similar role except that it starts from the end of the word and works its way up to a

maximum of 3 trailing characters and uses the list to determine suffixes in the same way as it

Page 13: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

does for prefixes.

Points Gathered From this report

• This report gave an initial understanding of how one might go about working on a solution

• Identified two possible stages in a solution, Affix Gathering & Segmentation.

• Limits set on character lengths to simplify analysis of words, however this could affect the

total reliability of the solution but perhaps a compromise has to be sought between filtering

out noise and reliability.

• Filtering out helped speed up processing.

• Dictionary sort used on the valid segmented list, the importance not yet fully understood.

2.3 SUMA-Simple Unsupervised Morphology Analysis Algorithm

(Dang, M and Choudri, S. 2005)

Key Terms:

• Successor Variety-The value of the number of different combinations succeeding a

substring of a word respective of the word list; used for prefix gathering.

• Predecessor Variety- The value of the number of different combinations preceding a

substring of a word respective of the words in the word list; used for suffix gathering.

• Peaks-The substring at which the maximum successor or predecessor variety value occurs.

In this report a slightly different method was used to produce a solution to the previous method

explored. This method had great focus on language pattern and structural recognition. The main

strategy is to record successor and predecessor varieties. However like the previous solution

there were two stages:

• Affix Gathering

• Segmentation

The varieties can be illustrated with an example, if the word list consists of the words:

1. Reading 2. Reads 3. Red 4. Rope 5. Ripe 6. Read

The algorithm like the previous one would start with the first letter of the first word so it would

examine the ‘R’ in Reading, it would then compare the possible different letter combinations after

‘R’ in the word list so in this case the algorithm would record a value of 3 (e, o, i).

The algorithm would then move onto the next letter and would search for different letter

combinations after ‘RE’ and record those values.

In this way the algorithm would go through the whole word and then move onto the next word

recording successor varieties.

Page 14: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Once the successor varieties have been recorded, the algorithm would search for ‘peaks’ I.e.

searching for the substring with the highest successor variety count.

The table below illustrates this with the word under examination to be ‘READABLE‘ in the list

above:

Substring Successor Variety

R 3, E,O, I

RE 2 , A,D

REA 1 D

READ 3 A, I, S

READA 1 B

READAB 1 L

READABL 1 E

READABLE 1 blank

Here it can be seen that two peaks occur, one at the beginning of the word and one at ‘READ’.

The report indicates that adding the two substrings above into the valid segment list and using

that to segment the words did not yield in great accuracy on evaluation.

The reason to this is simple to contemplate, so many words would start with ‘R’ and though it is

true that the successor variety would be very high, if taken into the valid segment list it would be

implying that every word beginning with ‘R’ would be segmented at the beginning which would

give an incorrect segmentation for most words.

This solution was then refined, Predecessor variety values were then collected in the same

manner as the successor variety values. In the same manner peaks were found and the results

seemed to improve slightly. This was explained as the words were more heavily suffixed than

prefixed in the word list. The end product was a hybrid of the two and that’s what produced the

best results out of the three methods used.

Points Realised from Report

• Again the solution is structured in two stages, this is beginning to seem like a fundamental

structure.

• Words that begin with ‘R’ alone would not be segmented at the beginning. Every single word

beginning with any letter would be segmented after the initial letter as surely the highest

successor count would be for the initial letter of a word as there are so many combinations

possible after the first letter of a word.

• A practical distinction is made between gathering prefixes and suffixes, in the previous report

such a distinction was not made, it was more general , gathering ‘affixes’.

Page 15: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

• To gather suffixes, a simple reversal is needed of the prefix gathering technique. This can

hold true for any technique used to gather the suffixes or vice versa.

• A hybrid approach gave the most accurate result in the evaluation.

• Not a lot of limits seem to have been set in this solution as opposed to the previous report.

2.4 A Simpler, Intuitive Approach to Morpheme Induction (Pitler, E and Keshava,S 2005)

More suited to Indo-European languages due to concatentative morphology. Two basis for this

approach:

1. Finding words that appear as substrings of other words

2. Detecting changes in transitional probabilities –originally proposed by (Harris, 1995) stated

that peaks would be present where given a word and checking to see what other words in the

corpus share the same starting string. Based on this approach (Hafer and Weiss, 1974) were

able to develop an algorithm that achieved 91% precision.

The four stages in this solution are:

1. Building Lexicographic Trees

2. Scoring Morphemes

3. Filtering

4. Segmenting

The first three stages can be thought of as the affix gathering stage.

Building Trees

The algorithm starts off by creating two trees:

• Forward Tree

• Backward Tree -Mirror of the forward tree.

If the alphabet contains b letters and the longest word in the corpus is of d letters length then each

tree would be b-way by d-depth long.

Each node of the tree represents a letter so any path from the root to a node would represent a

substring of a word, in the case of the forward tree, this would be the substring proceeding the

rest of the word and in the case of the backward tree the substring would be the one after the

word. The nodes contain the frequency of the corresponding substring.

The purpose of these trees are to calculate the conditional probabilities given a substring to

predict the next part of it. For example:

The forward tree can be used to gather conditional probabilities of suffixes I.e. Pr(sbook), this

Page 16: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

would be calculated by dividing the frequency of words starting with ‘books’ by the frequency of

words starting with ‘book’. Similarly the backward tree can be used to calculate Pr(ooks) which

is done by dividing the frequency of words ending with ‘ooks’ by the frequency of the words

ending with ‘oks’.

Scoring

After building the trees, two lists are to be formed of morphemes known as a prefix list and a

suffix list. The suffix list is filled by scanning each word in the list in increasing substring order

starting from the end of a word. If a substring is to be accepted into the suffix list it must fulfil 3

conditions. If the substring ‘xy’ is being speculated to be a suffix in the word ‘vwxy’ , then:

1. ‘vw’ must also be a word in the corpus

2. Pr(wv) = 1

3. Pr(xvw) < 1

The first condition is quite logical, it makes use of the fact that suffixes are commonly added onto

the ends of words so in order to claim a reliable suffix the remainder of the string must exist as a

word I.e. if the word being examined is ‘books’ and the suffix under speculation is ‘s’ then surely

it makes sense to find ‘book’ as a separate word in the list in order to give credibility to the suffix.

The second condition is implying that the stem(’vw’) only has one parent thereby identifying it as

a true stem and the third condition is stating that there can be multiple children I.e. different

suffixes applied after the stem.

In the same way there are 3 conditions for a prefix under speculation to gain entry into the prefix

list. They are simply the reverse of the suffix conditions.

If an affix passes the 3 conditions then it is given a score of 19 points and if it fails then 1 point is

deducted. When every word has been considered then the strings with positive values at the end

are all granted entry into either the prefix or suffix list. The reason behind the use of these

numbers was that a string would only have a final positive score if it passed the tests at minimum

5 percent I.e. 1/1+19 of the times it appeared.

Filtering

Sometimes, an affix in a list would be a combination of 2 affixes that are also present in the list,

this could give rise to incorrect segmentation later on so the filtering process checks for such

affixes and discards them.

Segmentation

This phase segments the words using the affix lists, however an issue of how to segment a word

may occur if more than one affix could be applied I.e. in the word ‘quietness’, the issue arises

Page 17: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

whether to segment it as ‘quitenes s’ or ‘quite ness’. This is solved by recalling the conditional

probabilities from the tree constructing phase and finding the one with the highest probability.

Points Gained from report:

• Quite a thorough approach, reduces inaccuracy greatly by reliability measures that were put

in place such as the 3 conditions.

• There is more than one way to segment a word, one could be more accurate than the other.

• Even after affixes being entered into the appropriate lists there still may be some

discrepancies as described in the Filtering stage

• Conditional Probabilities perhaps is something more advanced than successor varieties,

better solution?

2.5 Starting Point

Having reviewed the background literature I feel a rough prototype should be made as soon as

possible. Though the literature has provided a few solutions I deem them at this stage to be quite

complex to implement straight away. I feel I would need to find my own feet about this though I

will be looking closely at the SUMA report as In theory it seems to be simpler than the

conditional probability approach. The programming aspect will be quite important because if I

am not able to design the solution then it will be useless reading into all the literature. If I start

off with a rough program that will be used to experiment with the code then I can build up from

that and implement more complex algorithms.

3. Implementation & Evaluation

3.1 Phase I

The rapid prototyping methodology was applied so a series of prototypes in increasing

sophistication. Below are presented the details of implementation and evaluation of each

successive stage, rather than evaluating all of these in a separate chapter.

It is to be noted that the implementation is split into two parts, each part containing a set of

algorithms. The algorithms in the second set are analyzed greater in depth as by that point there

is more information handy to make analysis. As the first set of experiments were the starting

point, it is not possible to make much analysis.

The first issue that needs to be looked at is the datasets used. In this implementation a variety of

datasets were used, most are described in the appropriate algorithms sections but its important to

give an outline about them as it will help to understand the algorithms better.

Page 18: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

This is due to the rapid prototype approach that was taken; there was not just one implementation

and one evaluation but a constant implementation-evaluation circle so it is felt that it would

reflect a better account of the procedure taken.

Also a brief description on the evaluation is vital for clarity purpose.

Evaluation Description

Evaluation was done via the Evaluation.perl script obtained from the Morphology Challenge

2005 website. It would compare the Results List output from the algorithm to the gold standard

file provided also on the website. The gold standard file contained the standard of segmentation

that was being sought and upon which accuracy results would be given.

After comparing it gives 3 results relating to the suggested file:

Precision: number of hits divided by the sum of the number of hits and insertions.

A hit is when a boundary placed in the suggested file matches with a boundary placed in

a word in the gold standard file I.e. correctly placed.

An insertion is when an incorrect boundary has been placed in a word I.e. the boundary

placed in the suggested file is not present in the gold standard file word.

Recall: the number of hits divided by the number of hits and deletions. A deletion being

that when a boundary should be in a word but it isn’t there.

F-Measure: Harmonic mean of precision and recall.

There are two methods of evaluating the output, Precision and Recall, the F-Measure is the

average of the two but to keep the evaluation simple and easy to follow, the result of Precision

will be held analyzable though it states on the Morphology Challenge website that the contestant

with the highest F-Measure will be the winner and if a tie is to occur then the contestant with the

highest precision will win. For this reason the F-Measure will also be included in the results but

not analyzed upon in depth, it will be a means of illustrating an average of two different

evaluation methods. A point to note is that this evaluation script was not run on the initial

algorithms as it was thought there was not enough grounds present to give an accurate result

therefore it was not considered, this may have been due to the lack of prefixes calculated or that

the size of the dataset was too small.

Page 19: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Datasets

Definition of Data List: The list of words given on the Morphology Challenge 2005 webpage.

The original list is alphabetical and contains frequencies before each word. The size and entries

of this list are subject to modification, i.e. samples can be taken from this list and referred to as

Data List. This List is used to extract affixes and checking for presence of words

(Implementation Part II).

Definition of Evaluation List in the context of this implementation: The set of words on which

the segmenting module segments.

The 4 major algorithms: X4, X5, X6, X7.

There were two major Evaluation Lists in this implementation, the first being D1 and the latter

D2. It is to be noted that segmentation were not done on these evaluation lists till mid-way,

algorithms X1-X3 especially as it was realised that there was no common evaluation list to

compare the output.

D1

This served as the first comparison of the results obtained from the 4 major algorithms developed.

It constituted of the 200 most frequently occurring words and it was this that was used at first to

see the difference between the applications of the major 4. (See Appendix 1, Evaluation List D1)

D2

Having seen that the evaluation of the D1 dataset was not highly productive (details given further

on), this dataset was incorporated. It constituted of exactly the same words that the evaluation

script checks for and depending on the segmentation, gives scores. From this it was thought a

much greater reliability of results could be had, not necessarily greater in accuracy. See

Appendix 2, Evaluation List D2.

3.1.1 X1

Aim of algorithm: test Segmentation and File I/O procedures.

Sample of Results: Appendix 3, Sample Results X1

Size of Data List:1000

Size actually used: 1000

Evaluation List: Same as Data List

Pseudo code:

Page 20: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Read in the first 1000 words from the word list and store in array.

Sort array in shortest length to longest order.

For k:size of array -1

String A=array[k]

For j=k+1:size of array -1

String b=array[j]

Check if B contains A

If B contains A:

Segment B in position of A.length

Else:

Do nothing.

Output the segmented array.

The first algorithm that was implemented is known as X1, it was a simple implementation for

which the basis was to check all words against each other in the Data List which at this initial

state was set to 1000 words, irrespective of numerical frequency or alphabetical order. The first

1000 words were taken into an array and that array was sorted in an order of shortest to longest

word. This was done to make this algorithm work as it was assuming that the shorter word would

be part of the larger word e.g. if the word list was:

Read

Reads

Reading

Then it makes sense to compare Read with Reads and Reading, and to segment after Read in each

word, also not to compare Reading with Reads or Read. Arranging in such a way simplified

matters as the program can work through the list systematically once.

This prototype was also made to test the segmenting semantics to get an idea of how the words

could be segmented. In my opinion, implicitly perhaps but the rest of the algorithms or the

mechanics of the algorithms are based on this first prototype to quite an extent. It was the first

stepping stone. An important point to be mentioned is that in algorithms X1-X4, the Evaluation

List(the one that is segmented) is the same as the sample that is taken from the Data List, so in the

case of X1, the list of words that are segmented are the first 1000. See Appendix(1, Sample

Results X1). The sample of the results list show that there is a lot of segmenting, quite randomly

spread out but the point of running this algorithm was to get an initial idea of how things would

run.

Page 21: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Evaluation: too early

3.1.2 X2

Aim of Algorithm: Test SUMA algorithm.

Sample Results: Appendix 4, Sample Results X2

Size of initial Data List: 1000

Size actually used: 1000

Evaluation List: Same as Data List.

Prefixes found: 3

The second algorithm that was looked at was the one described in the SUMA report (Dang, M and

Choudri, S. 2005) this algorithm was based predominantly on Successor varieties. Each substring

of a word is considered in increasing order, the substrings compared and the differences of the

following letter taken into account, so for example:

Read

Reads

Reading

The system would first compare the substring ‘R’ of read with substring ‘R’ of Reads, for the

purpose of reliability substrings less than 1 do not count, so once the substring count has gone up

to ‘Re’ with ‘Re’ then it moves onto compare the next letter of the two words, if they’re different

then the system increases count of the successor value for that particular substring ‘Re’. Once it

checks that level of substring from ‘Reads’ it moves onto the next word in the list to compare the

same substring; ‘Re’. This continues until all entries have been checked, it then comes back to

the first word and increases the substring so it becomes ‘Rea’ and repeats the process, counts the

differences and excludes the similarities. Once it’s finished comparing the whole first word then

it continues onto the next word and so on.

After this the program calculates the maximum successor value for each word and adds the

corresponding word to the ‘prefix’ list which is then run on the Evaluation List and segmented at

every occurrence of a prefix entry.

So to review, X2 was applied to 1000 words irrelevant of order or frequency of word. The output

(Appendix 4, Result List of X2), was interesting to see as it was the first application of the SUMA

algorithm, it listed the prefixes detected, ‘aa, ab, ac’ and then just segmented the 1000 words

wherever these strings occurred. This gave an indication that the algorithm was indeed working

but the prefixes detected did not seem to be very prefix like, there was only 3 detected. The

Page 22: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

reason for this was observed to be the sample Data List, 1000 words were far too small to give a

reliable calculation of prefixes so for the next run of the algorithm, the element of increasing the

sample set was added.

The pseudo code for this algorithm is shown below, the differences between the versions are

highlighted in bold:

Pseudo Code:

Read in first 1000 words from word list, save in array.

Remove the frequency in front of words, leave just the word on each line.

Sort the array from shortest to longest length order.

Run the comparison module, storing successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the Affix array

appears.

Evaluation: Not enough grounds to run script; dataset far too small, vaster datasets need to be

considered to give a greater variety of prefixes.

3.1.3 X3

Aim of Algorithm: Increase Data List to achieve improved accuracy in segmentation.

Sample Results: Appendix 5, Sample Results X3.

Size of initial Data List:100,000

Size actually used: 100,000

Number of Prefixes found: -

Evaluation List: Same as Data List

Pseudo code:

Read in first 100,000 words from word list-save in array.

Remove the frequency leaving just words on line each line.

Sort array from shortest to longest order.

Run comparison module-store successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the Affix array

appears.

This algorithm was identical to X2 with the added fact that the dataset was increased to 100,000.

Page 23: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

This in theory would give the prefix gathering module a much wider selection to work with and

hence give a more reliable calculation of prefixes which would lead to more accuracy in

segmentation. The segmenting module was kept the same, which was based on X2.

After running the algorithm on the first 100,000 words from the data list still unsorted by any

means other than sorted from shortest to longest word; once read in to the system the results were

observed (Appendix 5, Sample Result X3). It seemed that quite a lot of prefixes were generated,

in fact in the thousands, some prefixes were really strange but what was even more strange were

the words, there seemed to be clear discrepancies in the actual words I.e. they were not proper

words(Appendicies 4 & 5) and at this point it was highlighted that the corpus that this word list

was drawn from itself was drawn from a number of sources. This was not known before and it

was established that there were indeed words that could be considered typos in the word list

which resulted in unnecessary noise and unclear difficult comparisons.

So at this point it was decided to try and filter out some of the ‘noise’ that was being caused by

inaccurate entries in the word list so a better comparison could be made.

Evaluation: Not enough grounds: words segmented all over the place, no way of comparing,

unsorted, very messy, needs tidying up. Data sample good but too much noise!

3.1.4 X4

Aim of Algorithm: Reduction of noise words in the Data List.

Sample of Results: Appendicies 6a, 6b, 6c

Size of initial dataset: 100,000

Size of data actually used I.e. >1: 62,261

Number of Prefixes found: 2476

Evaluation List: D1 & D2

Pseudo code:

Read in first 100,000 words from Data-List-save in array.

Extract the frequency

Compare frequency with the ASCII value 49 I.e. 1

If frequency > 49 (1):

Store corresponding word in new array

Else:

Ignore the word, do not store in new array

Page 24: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Use new array for prefix gathering.

Sort new array from shortest to longest order.

Run comparison module-store successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the Affix array

appears.

This algorithm is similar to X3 but this is the first run where a threshold has been introduced as it

was observed in the previous run that there were too many inaccurate words. In previous

algorithms the word frequency was ignored but the threshold relies on the word frequency hence

the algorithm had to be modified to allow this change to take place. The sample set was kept to

100,000 words.

By running this, the output of prefixes detected decreased significantly which can be taken as a

positive affect, the less the number of prefixes, perhaps the greater the reliability but a

compromise has to be sought.

However from a sample of 100,000 words there were still a good 2000+ prefixes detected

(Appendix 6a, Sample of Prefixes X4). It was hard to make any comparisons though with the

output from X3, this was due to not having the words in an order of some sort. It was decided in

order to make comparisons easier between the prefixes detected and the segmentation of the

words, some sort of order would have to be placed; the obvious choice being the alphabetical

order.

Evaluation on D1: Not done, D1 not vast enough for running script on, only 200 words and most

single morphemes, not constructive to run evaluation script on however the list was segmented,

see Appendix 5b.

Evaluation on D2 (Appendix 6c):

Number of words in gold standard: 532 (type count)

Number of words in data set: 532 (type count)

Number of words evaluated: 531 (99.81% of all words in data

Morpheme boundary detections statistics:

F-measure: 24.84%

Page 25: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Precision: 18.73%

Recall: 36.85%

This was the output from running the evaluation perl script. It shows a very low F-measure and

Precision. This can be attributed to the fact that there were so many prefixes detected and also

that the algorithm segments each word with every prefix that is detected within it.

3.1.5 X5

Aim of Algorithm: Further reduction in noise

Sample of Results: Appendix 7a, 7b, 7c.

Size of initial dataset: 100,000

Size of data actually used I.e. >10: 25,781

Number of Prefixes found: 2476

Evaluation List: D1 & D2

Pseudo code:

Read in first 100,000 words from word list-save in array.

Extract the frequency

Compare frequency with the ASCII value 97 I.e. 10

If frequency > 97(10):

Store corresponding word in new array

Else:

Ignore the word, do not store in new array

Use new array for prefix gathering.

Sort new array from shortest to longest order.

Run comparison module-store successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the Affix array

appears.

This algorithm was made and was run in parallel with X4, the difference between X4 and X5 is

that for X5 the threshold was increased to >10 I.e. only use words that occurred more than 10

times for prefix extraction. This was done to see how different the results would be between the

Page 26: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

two sets, if there was not a great difference then X5 could be used as the new standard to save

computational effort. X5 detected 199 prefixes and a look over the results (Appendix 7) shows

that there was not a lot of difference between the actual segmentation but again comparison was

extremely difficult as the prefix list was not sorted at this stage nor the dataset the same, I.e. the

dataset of words that were segmented in X4 were words that occurred more than once and the

dataset of words segmented in X5 were only the ones that occurred more than 10 times. Because

of this discrepancy a common dataset was needed to make a good comparison possible along with

the sorted list of prefixes as there were so many.

Evaluation on d1: Not done-unproductive although the list was segmented, see Appendix 7b.

Evaluation on d2 (Appendix 7c):

Number of words in gold standard: 532 (type count)

Number of words in data set: 532 (type count)

Number of words evaluated: 531 (99.81% of all words in data

Morpheme boundary detections statistics:

F-measure: 24.51%

Precision: 18.33%

Recall: 36.98%

There doesn’t seem to be much difference between the results of X4 and X5 therefore for any

further development X5 will be used as the standard dataset as it would save a lot of

computational effort as the Data List for X5 is 2/3’s less than the Data List for X4.

D1

The above descriptions may have seemed confusing relating to the Evalution and Data Lists. The

running of the Evaluation Lists D1 & D2 was done at the end of developing the algorithms

although they are included above. Algorithms X1-X4 segmented on the Evaluation List which

was the same as the Data List that was used to gather prefixes.

The reason for creating D1 was that initially the algorithms would segment the same input set

that they would be taking in and by the time algorithm X4 had been developed this segmentation

was a good 100,000 words or even with the threshold, the segmentation was in the tens of

thousands. Also the sample that X4 would be taking in would not necessarily be the same as the

sample segmented by X5 therefore comparison between them was very difficult.

Page 27: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

It was clear that there needed to be a common dataset and D1 was introduced (Appendix 1,

Evaluation List D1).

It would make comparison a lot easier as the dataset would firstly be in alphabetical order and

would not spur several pages as the before Evaluation lists did but could be fit onto a single sheet

each.

The evaluation script however was not run on the results from D1, this was because it was seen

from the results that this was not a suitable sample for the evaluation script to be run on and it

would not give a fair account of the accuracy of the algorithms.

This is due to the D1 dataset comprising of only 200 words and also the majority being single

morphemes themselves so by segmenting them it would be unproductive and does not really shed

the light on the capacity of the algorithm.

This is what induced the introduction of D2 and naturally as D2 contained every word same as

the gold standard file and hence the evaluation script was run on it.

3.1.6 X6

Aim of Algorithm: Gather Suffixes

Sampe of Results: Appendicies 8a, 8b, 8c, 8d

Size of initial dataset: 100,000

Size of data actually used I.e. >1: 62261

Number of Suffixes found: 3338

Evaluation Lists: D1 & D2

Pseudo Code:

Read in first 100,000 words from word list-save in array.

Extract the frequency

Compare frequency with the ASCII value 49 I.e. 1

If frequency > 49(1):

Store corresponding word in new array

Else:

Ignore the word, do not store in new array

Page 28: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Reverse each word in the array

Use Reversed array for prefix gathering.

Sort new array from shortest to longest order.

Run comparison module-store successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the

Affix array appears.

Re-reverse the output dataset so evaluation script can be run.

This algorithm followed the same steps as algorithm X4 but it served as a means to gather not

successor variety but predecessor variety values I.e. determining suffixes (Appendix 8a). Due to

the goal of this algorithm it was the reversal of the X4 algorithm, and a threshold was also set to

gather suffixes from >1 occurring words from a Data List of 100,000 words.

This was applied to the D1 Evaluation List I.e. sorted 200 most frequently occurring words and

also D2, however as mentioned above, the evaluation script was only run on the output from D2

(Appendix 8d).

A key note is that the same affix gathering technique was used in gathering the predecessor

varieties as the successor varieties. This was made possible by reversing each word in the array

and then gathering the successor varieties which in affect were the predecessor varieties. This is

the reason why the successor variety has not been changed to predecessor variety in the pseudo

code. The reversed set of the application of the algorithm is also available, see Appendix 8c.

D1 evaluation: not done, results of segmentation (Appendix 8b).

D2 evaluation (Appendix 8d):

Number of words in gold standard: 532 (type count)

Number of words in data set: 532 (type count)

Number of words evaluated: 532 (100.00% of all words in data set)

Morpheme boundary detections statistics:

F-measure: 28.53%

Precision: 18.93%

Recall: 57.92%

There is a slight improvement in the percentage of the F-Measure in this evaluation, the precision

remains similar. There isn’t any drastic change.

Page 29: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

3.1.7 X7

Aim of Algorithm: Gather reliable suffixes.

Sample of results: Appendicies 9a, 9b, 9c

Size of initial dataset: 100,000

Size of data actually used I.e. >1: 25781

Number of Prefixes found: 373 (Appendix 9a)

Pseudo code:

Read in first 100,000 words from word list-save in array.

Extract the frequency

Compare frequency with the ASCII value 97 I.e. 10

If frequency > 97 (10):

Store corresponding word in new array

Else:

Ignore the word, do not store in new array

Reverse each word in the array

Use Reversed array for prefix gathering.

Sort new array from shortest to longest order.

Run comparison module-store successor varieties for each part of a word.

Calculate maximum successor variety for each word.

Add to Affix array

Run segmentation module based on Affix array segmenting where a word from the

Affix array appears.

Re-reverse the output dataset so evaluation script can be run.

This algorithm was the same as X6 but a threshold of >10 was set I.e. only inspect words that

occurred more than 10 times. It can be said that X6 and X7 had the same purpose as X4 and X5

but the only difference being was that they were to gather suffixes and not prefixes.

This was applied initially to the dataset D1 for comparison but the evaluation script was only run

on D2 dataset.

D1 evaluation: not done. Sample of Results (Appendix 9b)

D2 evaluation (Appendix 9c):

Page 30: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Number of words in gold standard: 532 (type count)

Number of words in data set: 532 (type count)

Number of words evaluated: 532 (100.00% of all words in data se

Morpheme boundary detections statistics:

F-measure: 28.13%

Precision: 18.70%

Recall: 56.75%

Again there is an increase of F-measure in a similar way to the Prefix Alogrithms (X4 & X5) and

the same behaviour in terms of affix reduction, a lot less suffixes are detected when the threshold

is increased to 10 (remember the threshold is set and increased to filter out noise and inaccurate

words from the initial dataset). However the evaluation results show that the accuracy remains

significantly unchanged so to conclude , any further development would be done using the prefix

and suffix lists generated form X5 and X7 (Appendicies 7 & 9) due to computational effort

reduction and similar result production. There would be no point in using algorithms X4 and X6

as it would be time consuming.

Page 31: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

3.2 Summary of Evaluations

All 4 algorithms were compared on the dataset D1. It was seen that in X4 and X6 there were a lot

of prefixes and suffixes gathered compared to the numbers in X5 and X7 which is directly linked

to the frequency. There was not a lot of difference between the segmentation of X4 and X5 and

similarly between the segmentation of X6 and X7.

An interesting observation is that algorithms X4 and X6, the affixes that were gathered went up to

3 letters whereas in X5 and X7 with the greater filtering threshold the affixes went up to 2 letters.

It would be interesting to investigate if that could be the maximum number of letters that were

required to segment to a reasonably accurate extent.

Also, in all algorithms a check was set to not allow any affixes with length less than 2 which

serves as a filtering mechanism for noise but then the character ‘ ’ ’ would not be taken into the

affix array which is an obvious separator, so perhaps a compromise has to be sought between

allowing less than 2 character affixes but only if there is a very high frequency in them and

allowing greater chance of noise.

To make a better evaluation, the evaluation script from the Morphology challenge website was

run and the results gave an indication of the accuracy of each method.

When the comparison was made on the D1 dataset it was clearer than previous comparisons but

there was not a lot of breadth I.e. the 200 words that were the most frequently occurring words

were also single morphemes themselves so by segmenting them it did not give a great result as

the segmentation anyhow would be incorrect as they were already in the smallest morpheme

possible.

So to improve this analysis the same 4 algorithms were applied to dataset 2, D2.

D2

This dataset consisted of the same words that are used in the evaluation program. It was thought

it would be a good idea to use them as there would be diversity in the type of words I.e. there

would not be just small words, but larger words that would illustrate the proper segmenting limits

of the algorithms. Also, from this dataset a higher or more reliable evaluation result could be had.

This is because the evaluation dataset consists of 532 words and it looks through the dataset being

compared to it for those 532 words and gives accuracy results on how those words have been

segmented. In initial algorithms like X1 and X2 the result of such an evaluation would be quite

low due to the dataset only being 1000 words but it would definitely improve with X3 as the

dataset was increased to 100,000. For algorithms X4 and X5 the improvement in evaluation

Page 32: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

result would be dependent on the thresholds used as the sample is at 100,000 words too. The

dataset D1 would not be much help for the evaluation as there would not be many common words

between them but the dataset D2 would be as precise as possible as all the words would be

matching.

Appendix C shows the summary of the 4 major algorithms

Only the 4 major algorithms are included in the summary table as they were seen to be the fittest

among the rest.

The percentage of affixes gathered does seem to be slightly higher for the suffixes, perhaps that

explains the slightly higher F-Measure found in them.

The values between each prefix and suffix set are quite similar hence the decision for any

improvements to be based on X5 and X7. The reason here can be illustrated by looking at the

size of the dataset that is used for affix extraction for >1 sets. It is more than double that of the

>10 set so it would cut down computation by minimum half.

Appendix D shows the different components of the program and the general flow.

The diagram is a general diagram to illustrate the workings of the algorithm implemented so far.

The input class is called for all the file input so the Evaluation List and the Data List are called in.

They are then if need be filtered(algorithms X4-X7), also when gathering the suffixes the array

can be passed onto the Reversal Class which simply reverses a given array list.

Then the AffixGather class gathers either prefixes or suffixes depending on which algorithm is

being run. The affix list is then trimmed to get rid of any trailing or leading spaces. This helps to

make sure that no extra spaces have been added to the affixes. The Segmentation class then takes

in the affix list after trimming and also the evaluation list. It then carries out segmentation and

can pass it back for trimming, this double checks the entries for any whitespace. The two arrays

are then passed onto the output Class which writes them to a file. This diagram is does not

exhaustively include the variables but it gives an outline of how the algorithm is implemented.

Putting everything into different classes allows the re-use of code which makes it easier to

program. The diagram does not imply that the Reversal Class can only be called after filtering,

many times in implementing the above algorithms I have had to pass the files to the Reversal

Class without filtering, every class is accessible independtly, the diagram below shows the

general flow of the algorithms.

3.21 Issues Encountered

There were a number of issues encountered so far throughout this implementation, apart from

many bugs in the programming that had to be solved to give accurate results one significant issue

Page 33: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

was that of getting the evaluation script to work.

The evaluation script when first run would skip some words in the suggested file, after adding on

the ‘-trace’ argument in the command line it showed the output of each word according to hits

deletions and insertions. It kept on giving an error to the affect of:

Error (evaluation .perl): Mismatch in string Comparison. Char t vs. .

This caused a lot of problem, until it was realised that it was due to extra spaces from

segmentation at the end of words so all the word lists had to be trimmed to get rid of trailing and

leading spaces. After this the evaluation script ran fine, I think this point perhaps should be made

on the Morphology Challenge website if not already done.

Another confusing problem relates to when X6 was reached, to extract the suffixes the Evaluation

List, the Data List had to be reversed to gather the suffixes. It was here that I got confused and

did not realise that the suffixes that were output were actually in reverse order, I then used these

suffixes to segment the Evaluation List the right way around. This caused a lot of confusion and

for it to be corrected the procedure had to be repeated, all three lists had to be reversed and then

the segmentation also had to be carried out with the Evaluation List and Suffix list still reversed.

Only at the end could the output Result List be reversed to show the results.

There was also the issue of the time the algorithm took to filter the Data List and to gather the

affix lists and then segment. At first everything was in 1 class and it took a very very long time to

run, in the hours time duration. So to solve this problem the different parts of the program were

called separately and when the Data List had been filtered and saved to a file it could be re-used,

so there was no need to run the whole program again. For example the Data List for X5 was the

same as the Data List for X7, there was no need to gather the Data List again.

3.3 Implementation/Evaluation Phase II

From this point onwards it is important to note the following assumptions for the remaining

algorithms:

1. Three sets of word lists will be input into each.

a. Evaluation List, this is the same as the D2 dataset, the 532 words that the

evaluation script looks for in the results list. This will be the set list to be

segmented and the success of the algorithm will be based on how well this list of

words is segmented.

b. The Data list, this is the collection of words from the original Data List obtained

from the Morphology Challenge website. However it contains strictly only

words that have frequency greater than 10.

c. The affix list, this could either be the Prefix List or the Suffix list depending on

Page 34: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

which algorithm is being run.

2. Previously, the algorithms X5 and X7 were used not only for affix extraction but also

segmentation but this is no longer the case. These algorithms have been adapted to serve

the purpose now to create the Data List and the Affix Lists only. However it is not

required to run them more than once, as running either one of them will produce the data

list and running them both will produce the Affix lists which can be consistently re-used.

This will save lots of time and computational effort.

3. The upcoming algorithms isolate the segmentation module and work on trying to

construct a good knowledge base; this is the reason why in the adapted code for X5 and

X7 the segmentation section has been removed. There was no knowledge base that could

differentiate between a good segmentation and a bad segmentation; it would simply add

segmentation wherever the affixes were detected. It would use all the affixes present to

segment resulting in lots of segmentations on a single word.

4. The Prefix and Suffix lists used for the upcoming algorithms were not the same ones

generated by X5 and X7. There was a great error noticed in those lists which is explained

in the Summary of Evaluations section (4.1). The lists can be viewed in Appendix 17 for

Prefixes and Appendix 17b for Suffixes.

The aim of this section is to try and develop some sort of knowledge base that can differentiate

between applying different segmentations whereby more accurate segmentations could be

attained.

As the algorithms X5 and X7 had the highest results, the prefixes and suffixes they have gathered

will be used as the basis so in effect the algorithms described below reuse code from X5 and X7

but hold a more comprehensive segmentation module.

Diagram to illustrate this:

Each algorithm takes as input 3 files, evaluation word list (words that will be segmented), data

list (all words that occurred more than 10 times in the corpus, this was obtained from running

algorithm X5) and an affix list.

So for example:

Appendix E.A Shows the running of the XX1 algorithm, the input and output in relation to lists.

This is a typical example of the first algorithm XX1; it takes in the Prefix List and Data List

which was derived using X5. It also takes in the Evaluation List, and the role of XX1 is

segmentation. Just like this example the rest of the algorithms will only differ in either having the

Suffix List input rather than the Prefix list and the process of segmentation. On each Evaluation

Page 35: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Result List the evaluation script will be run and the scores recorded for analysis.

A very important point is that when the algorithm is referred to for example XX1 and though it

has adapted code from X5 or X7, it should be thought of as a whole algorithm in its own right. It

is important not to be misled by the diagram above and assume XX1 is just the segmentation part;

the whole diagram is XX1 although the main focus for its creation is to enhance segmentation.

The purpose of the diagram is to help with clarity over the input and output.

The reason why clarity is so important in this aspect is that when the algorithms X4-X7 were

evaluated in Table 1 they were not thought of as segmentation modules but more so complete

algorithms, the upcoming algorithms should be thought of in the same manner, they are

collaborations of adapted old and new algorithms.

3.31 Summary of Development

Appendix E Shows the incremental stages of each algorithm and how each one relies on the other.

This diagram shows the process of developing the algorithms, it can be seen that they are all

based on one another; each is like a fine tuning of the one before. The final algorithm is a

combination of XX5 and XX6 known as XX7.

The structure of explaining the algorithms is similar to before but the results will be looked at

more closely and Global and Local Issues noticed will be stated. Global would be defining issues

thought to affect all of the algorithms and Local defining issues that are thought to be specific to

the algorithm. Also a comment on whether the algorithm was an improvement will also be made.

3.32 XX1

Result List: Appendix 10, Results of XX1

Pseudo Code:

Read in Data listRead in Evaluation listRead in prefix list

For int I:word.length-1 I++For int j:prefix list.length-1;j++

If word[I].contains(prefix[j])

Int prefixStart, prefixEndSuggested segment= words[i].substring(0,prefixStart)

+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);

String beforePrefIf datalist.contains(beforePref)

Page 36: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Count +=1Else

Count -=1Store prefix scoreStore suggested segment

ElseDo nothing

End loop

Find Max prefix scoreIf Max > 0

Add corresponding segment to outputArrayElse

Add original word to outputArrayEnd Loop

This pseudo code explains the function of algorithm XX1. The 3 sets of lists are read in, the

algorithm then iterates over the evaluation list checking if any of the entries from the prefix list

are present in the word. It investigates any occurrences of prefixes, stores their starting and

ending values present in the word and from that creates a suggested segment string. It then

divides the suggested string into 2 substrings, one substring is the string before the segmentation

and the other is the substring after segmentation.

N.B The phrase segmentation refers to the location where the “ “ empty string is added i.e. space.

The substring before the segmentation is known as beforePref and is checked for presence in the

Data List. Depending on if the substring is found the count value is either increased or decreased

by 1. At the end of the loop all the suggested segmentations are stored with their corresponding

count values. The count array is then checked for maximum count value which would indicate

the best prefix for that word. Also, the prefix is only accepted if the maximum value of the count

is greater than 0. This is to ensure that the beforePref substring is present in the Data List and

hence would be given a +1, indicating a positive score. If no count values are detected greater

than 0 then the original non-segmented word is added to the result.

The suggested segment corresponding to that prefix is then used as the final segmentation. After

this the algorithm moves on to the next word in the Evaluation List.

The purpose of these enhancements was to increase the integrity of the segmentation process.

The different prefixes that applied to a word could be seen and a count value was used to measure

the accuracy. This served as a means of increasing the reliability of the segmentation rather than

just segmenting any prefixes detected in a word as was done in the previous algorithms.

Page 37: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E1 shows the algorithm XX1 running on a word:

The prefix column shows all the prefixes that are common to the word and the suggestion column

shows what the suggested segment would be if the corresponding prefix was used. The count is

based on if the substring to the left hand side (beforePref) of the segmentation is in the Data List.

The program chooses the segment Aero planes’ as the selected one. It can be seen that though it

is not perfect compared to the gold standard as there is a space missing after the ‘e’ but there is a

correct segmentation there between ‘Aero’ and ’planes’. Also the second required segmentation

after ‘Aeroplane’ has been detected and a count score of 1 given which implies that the correct

segments are being detected. However the first suggestion picks up an incorrect segmentation,

the reason to this being that ‘Ae’ is detected as a word in the Data List even though ‘Ae’ is not a

valid word, hence this does add noise to the results; this problem is not isolated to this word.

To enhance this algorithm a further test should be done to check for both parts of a word, this

would add to the authenticity of a word, so for example ‘Ae’ may be picked up as a word but it is

highly unlikely that ‘roplanes’ will be picked up so the count value should decrease and in effect

not be picked up as a possible segmentation.

Similarly the same goes for most of those words segmented after the first two letters (Appendix

10, Results of XX1), words with prefixes such as ‘se’ and ‘so’ etc. In fact, most of the words in

this output are segmented after the first two letters simply because they have been detected as

words in the Data List.

Appendix E2 shows the highlight of another problem that was experienced:

This table shows the example of the word ‘agreeably’ and according to the semantics of the

algorithm the correct segmentation should have been chosen. However, though the correct

segmentation was detected, it was given a score of -1 due to the word ‘agree’ not being present in

the Data List. To remind, the Data List comprises of all words that had frequency greater than 10

in the corpus. This should have been a good means to test the integrity of the data list but this

example and the previous begin to suggest that the dataset perhaps isn’t as reliable as would have

expected.

The results gained from running the evaluation script on the output were:

Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 531 (99.81% of all words in data setMorpheme boundary detections statistics:F-measure: 21.74%Precision: 28.54%Recall: 17.56%

Page 38: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

The results are an improvement compared to the previous algorithm runs where there was lower

precision (Table 1). This can be attributed to the added complexity of the segmentation stage.

Improvements gained from XX1:

1. Considers different suggestions of segmentations and gives a score based on the presence of

the first part of the word therefore the algorithm is based on more comprehensive evaluations

of the segmentations than before.

2. Improvement in precision.

Global Issues highlighted from XX1:

1. Dataset contains words that are not real words.

2. Dataset Is missing perfectly good English words which hinder this algorithm and any that

would want to check the presence of words from the data list, a larger sample of reliable

words would definitely help.

Local Issues highlighted from XX1:

1. Only considers first part of the word which may not be an actual word which would give a

count value of +1 and therefore the segmentation would be accepted.

2. Does not consider more than one segmentation at a time.

3. Does not consider the substring after the segmentation (afterPref).

3.3.3 XX2

Result List: Appendix 11, Results of XX2. Pseudo Code:

Read in data listRead in word listRead in prefix list

For int I:word.length-1 I++For int j:prefix list.length-1;j++

If word[I].contains(prefix[j])Int prefixStart, prefixEnd

Suggested segment= words[i].substring(0,prefixStart)+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);

String beforePref, afterPref

Page 39: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

If datalist.contains(beforePref)Count +=1

Else Count -=1

If data list.contains(afterPref)Count +=1

Else Count -=1

Store prefix scoreStore suggested segment

ElseDo nothing

End loop

Find Max prefix scoreIf Max > 0

Add corresponding segment to outputArrayElse

Add original word to output ArrayEnd Loop

From the pseudo code it can be seen that the substring after the segmentation (afterPref) is also

checked for in the Data List. The score of +1 and -1 are still allocated depending upon success, a

point to be noted here is that a segmentation will only be accepted if the beforePref and afterPref

are present in the Data List due to the allocation of scores i.e. both parts must be present. This

can be seen as a stricter measure than before where only the beforePref substring was needed to

be found.

Because of such a measure, the majority of words from XX1 that were split right at the start

incorrectly have been corrected e.g. ‘unit y’ as opposed to ‘un ity’, ‘weigh ing’ as opposed to ‘we

ighing’ (Appendix 11, Results of XX2).

Appendix E3 shows a problem found in XX2:

Though from XX1, ‘action ‘s ’ is closer to the gold standard than the result of this XX2 it is

interesting to see that both segmentations that could lead to the gold standard are present here, for

suggestion 2 the substring ‘ ion’s ’ is not detected in the dataset and for suggestion 5 there is no ‘s

in the dataset. If these substrings were present then the Count value would be +1 for both.

This is a similar problem shown in Table E2 but then it cannot be expected for the Data List to

contain ‘ion’s ’ as it is not a word on it’s own but then again, the Data List contains all sorts of

entries from single letter entries to words that are not even words, this seems like an ongoing

problem. Because this algorithm checks both parts of a word for being present in the Data List it

may not split every word but that can be seen as being more reliable due to the increased

Page 40: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

constriction.

Though ‘s is not in the Data List, if the algorithm was run with the suffix list, then the words

would have to be reversed and s’ is in fact a word in the Data List so it is quite possible that a

correct segmentation would be applied.

To back this point up further, in the results hardly any of the words ending in ‘s are segmented

which suggests the limitation of this algorithm being again dependent on the integrity of the Data

List.

The results gained from running the evaluation script on the output were:

F-measure: 23.64%Precision: 60.43%Recall: 14.69%

The precision has increased drastically, from the results of XX1. The main reason as explained

above due to less words being segmented at the start of the word. However a reasonable F-

Measure needs to be attained. Solving the problem highlighted in Table E3 could help.

Improvements gained from XX2:

1. An even more comprehensive means of evaluation for the count variable, considers both parts

of a word.

2. Improvement in precision

Global Issues highlighted from XX2:

1. Dataset entries still causing a problem.

Local Issues highlighted from XX2:

1. Substrings such as ion’s and ’s are not included in the data list, the second part of a word is

generally not in the dataset.

2. Does not consider more than one segmentation at a time.

3.3.4 XX3

The only difference in this algorithm from XX2 is that the reversed lists of words are read in and

that the suffix list is being used to derive the suggested segmentations instead of the prefix list.

Results List: Appendix 12, Results of XX3

Pseudo code:

Read in Data listRead in Evaluation listRead in suffix list

Page 41: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Reverse Data List, Evaluation List, Suffix List

For int I:word.length-1 I++For int j:prefix list.length-1;j++

If word[I].contains(suffix[j])Int suffixStart, suffixEnd

Suggested segment= words[i].substring(0,prefixStart)+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);

String beforeSuffix afterSuffix

If datalist.contains(beforeSuffix)Count +=1

Else Count -=1

If data list.contains(afterSuf)Count +=1

Else Count -=1

Store Suffix scoreStore suggested segment

The results gained from running the evaluation script on the output were:

Morpheme boundary detections statistics:F-measure: 19.53%Precision: 55.83%Recall: 11.83%

It can be seen that again the precision is quite high. The results set (Appendix 12, Results for

XX3) is quite similar with XX2 but in this algorithm It would have been expected the word with

‘s ending to be segmented, it does not seem to have done any of that.

Appendix E4 highlights the problem relative to XX3:

This example is relatively similar to E3 as it can be seen that the suggestions are quite accurate,

but there is trouble allocating the score. The reason why the 1st one didn’t get through was that

‘ings’ does not exist as a word on its own. Also for the second suggestion ‘s is not a word in the

Data List so the point was not given.

This algorithm has not been as fruitful as the results suggest reduction in Precision and F-

Measure hence XX2 will be looked at further to investigate fine tuning.

3.3.5 XX4

Results List: Appendix 13, Results for XX4

Page 42: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Pseudo Code:

This algorithm builds on XX2, the pseudo code was the same as XX2 but with a slight difference in the scoring section.

If data list.contains(beforePref)Count += beforePref.length()

Else Count -= beforePref.length()

If data list.contains(afterPref)Count +=1

Else Count -=1

Store Prefix scoreStore suggested segment

By the development of this algorithm it was realised from the previous evaluations of the

algorithms that the Data List was not as reliable as at first thought for the required algorithms and

it was unclear whether to obtain a further set of word lists from another source was permitted.

However it is also true that parts of words such as ‘ing’s’ are not proper words so in an ideal Data

List such words should not be present and a different approach should be devised to overcome

this issue. This lead to the deduction that perhaps if the beforePref was a proper word but the

afterPref didn’t necessarily have to be a word in the data list, what difference would it make.

The results gained from running the evaluation script on the output were:F-measure: 41.92%Precision: 57.37%Recall: 33.03%

This was so far the greatest score achieved with quite high precision too. The score given was

proportional to the maximum beforePref segment that could be found in the dataset. This was

especially aimed at words that had no segmentation done to them from the XX2 results e.g.

footing’s, summer’s, surprises’ and the focus was to try and get some sort of segmentation out of

them as there was no ‘s or ing’s in the dataset so an alternative like this algorithm was needed.

From the results list (Appendix 13, Results for XX4), it can be seen that this algorithm has been

successful in splitting these type of words, summer’s has been split into summer ‘s, surprises’ into

surprise ‘s and footing’s into footing ‘s. As there are numerous words as such in the results list it

can be assumed that it contributed highly to the result score.

However, there are still some words that are not segmented correctly.

Appendix E5 illustrates a general problem:

This table highlights the issue of counts with the same value; though the correct segmentation

Page 43: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

was detected it wasn’t chosen as the full word reproached was found also in the Data List hence

making it the last entry in the array so when it came to calculating the maximum, the most recent

entry was taken. There were a couple of occurrences of this type of problem in the Results List,

however the results still show an improvement.

Another problem that was highlighted from the results list is illustrated in the next example.

Appendix E6-illustrates a problem specific to algorithm XX4.

This table highlights a flaw in the algorithm, the segmentation that is chosen is tel ephotograph,

though tel is not a word but it appears in the Data List so it is given +3 but ephotograph is a

ridiculous word to consider, it is far from being a word but only -1 is subtracted, this seems to be

a side affect of the algorithm as its main focus was the words that were not getting segmented in

the previous algorithms e.g summer’s. It had been decided that the beforePref substring had to be

a word but the afterPref didn’t necessarily have to be a word but the suggestion chosen here does

force to reconsider the strategy a little. However, not many words are affected by this problem,

that’s why the precision is still at a good level but for a good generic solution this strategy will

have to be modified.

Improvements from XX4:

1. F-Measure has doubled and also Precision increased, definitely scratching the surface of a

better approach.

Global Issues highlighted from XX4:

1. Dataset still causing problems but methods to work around it will need to be implemented.

Local Issues highlighted from XX4:

1. If the full word appears in the Data List it is given maximum count hence not segmented e.g.

E5 above.

2. The afterPref substring if not detected in the Data List, at maximum the count value decreases

by 1. Can give rise to quite obvious incorrect segmentations e.g. tel ephotograph, av uncular.

3.3.6 XX5

Results List: Appendix 14, Results for XX5

Pseudo Code:If datalist.contains(beforePref)

Count += beforePref.length()If count= word[I].length()

Count=0;

Page 44: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Else Count -= beforePref.length()

If data list.contains(afterPref)Count += afterPref.length()

Else Count -= afterPref.length()

Store Prefix scoreStore suggested segment

If word unchanged, store in remainder array.

To improve the algorithm XX4 further, in effect to focus on the local issues drawn from the

results of XX4 two modifications were proposed:

1. If the beforePref segment is equal to the length of the word, then set count to 0. This prevents

the word not being segmented e.g. example E5

2. Also to solve the 2nd Local Issue in XX4 the count value evaluation will be changed to be

proportional to both sides of the segmentation.

3. As an extra measure, words that are not segmented at all are stored in a separate file to be

analyzed further. These sets of words are known as the ‘Remainder’ words.

The results gained from running the evaluation script on the output were: Number of words in gold standard: 532 (type count)Number of words in data set: 308 (type count)Number of words evaluated: 308 (100.00% of all words in data set)Morpheme boundary detections statistics:F-measure: 59.62%Precision: 71.43%Recall: 51.16%

This is quite an improvement; the Precision again has been increased and so has the F-Measure

but then this result does not consider the ‘Remainder’ set of words(Appendix 14b, Remainder

List) so to check the remainder words along with this set of words:

Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 532 (100.00% of all words in data set)Morpheme boundary detections statistics:F-measure: 40.85%Precision: 71.43%Recall: 28.61%

The precision is still the same but the F-Measure seems to have dropped due to the reduction of

Page 45: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

the Recall measure but as Recall is not the focus of evaluation it can be said that this overall

algorithm has resulted in improved results from the previous algorithms.

It can be seen from the results set (Appendix 14a, results for XX5) that words like reproached,

responding, agreeably that were not segmented in XX4 have been segmented correctly.

Appendix E7 shows an example of a problem.

This example shows the word broadcasting which was output as ‘Broadcasting’ from XX4. This

algorithm segments it as ‘Broad casting’ which is closer to the gold standard than before so that is

an improvement but the problem is even though it detects the corresponding gold segmentation it

does not carry out both of them (suggestions 4 & 9). This has been a limitation throughout the

results. If the algorithm is run again, it could segment based on the other option too but for now

segmentations like this have improved the score but do limit progressing to greater accuracy.

Also the ‘remainder words’ contained many words that should be segmented (Appendix 14b, Remainder List).

Appendix E8 higlights on another problem.

This example shows how the algorithm XX5 did not detect even one correct segmentation in

words like this hence it was shifted to the Remainder List. There are a few words like this present

in the Results list (Appendix 14a) that suffer from the same problem.

Improvements from XX5:

1. Higher precision and F-Measure achieved.

Global Issues highlighted from XX5:

1. More than one segmentation not considered

Local Issues highlighted from XX5:

1. Does not detect correct segmentations on some words, indicating limitations of the prefix list.

3.3.7 XX6

Results List: Appendix 15, Results for XX6

Pseudo Code:Read Remainder listRead Data listRead Suffix list

If datalist.contains(beforeSuff)Count += beforeSuff.length()

Page 46: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

If count= word[I].length()Count=0;

Else Count -= beforeSuff.length()

If data list.contains(afterSuff)Count += afterSuff.length()

Else Count -= afterSuff.length()

Store Suffix scoreStore suggested segment

if word unchanged store in More Remainder list

This algorithm was focused on the local issue in the XX5 algorithm, the main aim was to achieve

segmentation on the words in the Remainder List as the prefixes did not detect any correct

segmentations.

The algorithm was the same as XX5, the only difference that instead of the prefix list being used

to find possible segmentations, the suffix list was used to generate suggestions.

Results:

Number of words in gold standard: 532 (type count)Number of words in data set: 34 (type count)Number of words evaluated: 33 (97.06% of all words in data set)Morpheme boundary detections statistics:F-measure: 63.41%Precision: 78.79%Recall: 53.06%

The evaluation script was run on the segmented words from the Remainder List and it can be seen

here that the F-Measure has risen along with the Precision which illustrates greater accuracy.

Though there were only 34 words that were segmented, it still shows an improvement. The

words that were not segmented were added to the ‘More Remainders’ List (Appendix 15b, More

Remainder).

However a few odd words have been segmented incorrectly after running this algorithm.

Table E9- incorrectly segmented word in XX6

It can be seen from this table that the word believe is segmented incorrectly as the first

suggestion, the reason to this being that the segment ‘belie’ is detected as a word in the Data List

which seems to be a recurring ongoing problem.

Page 47: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Global Issues highlighted from XX6:

1. Data List integrity

Local Issues highlight from XX6:

1. None detected

Improvements from XX6:

1. Increase in precision, more words segmented.

3.3.8 XX7

Results List: Appendix 16, Final Result XX7

This was the collaboration of algorithms XX5 and XX6, XX5 was run and then the words that

were added to the ‘Remainder’ word list had XX6 run on them. Then the 3 results lists were

combined, the segmented words after running XX5, the segmented after running XX6 and the

non-segmented More Remainder List.

Appendix F illustrates the running of the final algorithm, XX7.

To achieve the final result, the outputs from running the XX6 (segmented list and non-segmented

list) need to be combined with the output from XX5. This will give the overall result of all 532

words.

Results

Evaluation of segmentation in file "finaltest.txt" againstgold standard segmentation in file "eval.txt":Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 532 (100.00% of all words in data setMorpheme boundary detections statistics:F-measure: 44.32%Precision: 72.14%Recall: 31.99%

This is the highest result from all the algorithms, though it is lower than the output of XX6 which

is due to XX6 only applied to 34 words. As an average and collaboration of XX5 and XX6 it is

quite good. It was only logical to use XX6 on the words that were not segmented by XX5 as

XX5 relied on prefixes and XX6 on suffixes. There still are 190 words in the More Remainder

List; it is possible that they are in their shortest morphemes but having a look at the More

Remainder List (Appendix 15b, More Remainder List) shows there still are words that could be

segmented. These are outlined in the Extending Section in the following chapter.

Page 48: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

4. Summary of Evaluations/ Conclusion

Appendix H shows the results of all the major algorithms developed.

This table shows the review of scores from all the algorithms developed. The results show a

significant improvement from XX1 onwards. This is credited to the scoring system implemented

in the segmentation module rather than applying every single affix found in a word, See

Apendicies X1 and 3-8.

Algorithms X5 and X7 were adapted to be used as methods of extracting prefixes and suffixes

after the independence of the segmentation module was made. The end algorithm certainly seems

to work well with the English language.

To review, the strategy that was taken was to implement algorithms in increasing sophistication,

at each step of completion, evaluation was made and the most noticeable issues were stated. The

next increasing step would be focusing on trying to solve the problems found in the previous

algorithm.

4.1 Extending/Issues

The end algorithm XX7 gives a pretty good score but there are still issues within this overall

algorithm that were not solved due to time constraints.

1. The integrity of the Data List was an ongoing problem, at the start of the first implementation it

was thought that the Data List supplied was filled with complete and perfect English words. As

the development progressed it was realised that this was not the case and there were many

incorrect words within the data list. Measures were taken to try and improve the contents of the

Data List to filter noise out such as in X5 and X7; only including words that had frequency

greater than 10. Further measures were also added from the XX4 algorithm onwards to try and

overcome the vagueness of what exactly is present in the Data List. Even up to this point it is not

known what is present in the Data List, if anything it can be said the data list contains a random

collection of strings, I don’t think it would be accurate to say that it is a collection of words as

there are so many inconsistencies within. Perhaps if a more reliable Data List was provided I feel

a better solution could have been developed or even the current solution could work better.

However, the filtering procedure for words in the Data List could definitely be improved as it is

quite simple and a more complicated filtering approach could be devised to make the solution

better.

2. The affix lists were gathered using algorithms X5 and X7, the sample set was 100,000

words from which only 25781 words were used for affix extraction. The problem that was

discovered was when it came to implementing the new segmentation module which relied heavily

Page 49: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

on the already extracted affix lists; the words after the letter ‘n’ were not examined in the Data

List. It was realised that the original Data List was alphabetically sorted and the 100,000 sample

only went up to words starting with n. So to overcome this problem the sample set was increased

to the whole Data List of 167,374 words. The amount of words that had greater than 10

frequency were 43329, that’s almost double the word sample used before which shows just how

much accuracy had been lost. Also the prefix total increased from 199 to 363! The suffixes

increased from 373 to 387. The results for X5 and the X7 algorithms were left as they were, there

did not seem to be a point using their segmentation as by the stage this problem was noticed the

separate segmentation module was being developed so the segmentation module (XX1-XX7)

used the new Data List, Prefix and Suffix lists as a basis which were extracted using the adapted

code of X5 and X7. It would not have made much difference if the X5 and X7 algorithms used

the increased lists; the segmentation part within them just segmented on every affix detected

leaving many spaces.

3. The segmentation does not consider more than one correct suggestion at a time I.e.

example E7. However it would not take much more effort to have XX7 check any similar values

and to apply that segmentation too in a recursive way.

4. The More Remainder list (Appendix 14b, More Remainder List), mostly contains words

that are made up of other words along with a suffix at the end, for example; aftercare’s. If it was

split into two ‘after’ and ‘care’s ‘ the word care’s would not be detected in the Data List which is

why the word remains un-segmented to this point. But to extend the algorithm, it can check

where the first segment is detected, between after and care’s. It can then check for a valid split in

the second segment ‘care’s’ on its own and if it exists it can infer that the split between after and

care’s is also valid. It can be extended to work like a recursive algorithm so to suggest the pseudo

code:

Read in Prefix, Suffix, Evaluation, Data Lists

Run XX5

For each suggestion,

split into beforePref and afterPref.

If beforePref in Data List, give beforePref.length score

Run XX6 on afterPref,

If valid segment present, accept segmentation after beforePref and in afterPref

Appendix G shows the expected running of this algorithm described by the above pseudo code on

Aftercare’s. The overall score for dividing care’s is greater than 0 so both segmentations would

Page 50: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

be accepted. This would be the solution for words of this type.

The pseudo code could also include running XX5 on the beforePref segment just in case there

could be possible segmentations there too so in all, to run XX5 on the segment before the

proposed segment and to run XX6 on the segment after. I think that would be a good thorough

approach.

5. The successor variety is the main basis of affix extraction, if more complicated methods were

used or even combined with successor variety, for example conditional probabilities as described

in the background reading, it could really rocket the score.

6. The algorithm could be extended to work on other languages, though I was thinking of testing

the program on Turkish and Finnish not that it was a requirement of the Challenge, it was not

possible to do so due to the sheer quantity of the entries in their Data Lists, it kept causing my

system to crash, I was not even able to store them in a text file. The Turkish Data List contained

more than 500,000 entries and Finnish contained in excess of 1 million. There was not enough

time left to configure them. I also think if I did manage to store them and run the algorithm, it

would certainly take a long time, the prefix gathering with X5 on the full English dataset took

longer than 30 minutes, it would definitely take at least 3 times the amount of time to gather the

prefixes. It would have been good to test them, however there is nothing present in my algorithm

that suggests hard coding of material relevant to the English language, it is still a generic solution

even though I have not been able to test it on another language I am confident that it would give

an acceptable result.

Also not just segmenting but the other challenges on the Morphology Challenge website could be

considered.

4.2 Conclusion

Appendix I shows the scores from the Morphology Challenge website and I have added my score

at the bottom.

Looking at the scores I feel I have done well to get a good score based on very simple techniques

and intuition, seeing that most of the entries into the challenge were teams of professional

researchers I think I did well to achieve the score I did. I feel if I used more complicated methods

I would have an even higher score. The scores in bold in the table highlight those contestants of

whom my score was either greater than or close to, the total comes to 6/12 for my F-Measure and

as for Precision the scores in bold are 11/12 which seems to be very good.

I feel a good outcome has come from this but that there is great room for improvement, however

overall a good result has been achieved. The solution I have developed is generic and fulfils the

unsupervised criteria. I do feel I have exceeded my minimum Requirements, the minimum

Page 51: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

requirement was to develop some sort of a solution, I have developed a solution that is actually

quite good. I do feel that I have had to rush the ending perhaps as the schedule I first stated in

my mid-term report I found it to be quite ambitious. I did not realise how long the programming

would take and just how many problems would be encountered, especially the debugging. Also

factors such as how many iterations would be required

Page 52: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Bibliography

Asmah Hj. Omar. 1986. Nahu Mutakhir Melayu. Kuala Lumpur: Dewan Bahasa dan Pustaka

Aronoff. A and Fudeman. A. (2005 ) What is Morphology . Wiley Blackwell publishers. pp.1-2

Dang, M and Choudri, S. 2005. Simple Unsupervised Morphology Analysis Algorithm(SUMAA). In Proceedings of Morphochallenge 2005.

Accessed from- http://www.cis.hut.fi/morphochallenge2005/P09_DangChoudri.pdf on 23/04/2009-4:11pm

Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics . - Accessed from- http://acl.ldc.upenn.edu/J/J01/J01-2001.pdf pp.153-189 on 15/04/2009-2:00am

Jalaluddin. N. H. (2008).European Journal of Social Sciences- http://www.eurojournals.com/ejss_7_2_09.pdf- pp.109-115.Acsessed 24/04/09- 3.00pm.

Jurafsky, D and Martin, J. (2008) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition . Prentice Hall Publishers. Pg. 377

Harris. Z 1955. From Phoneme to Morpheme. Language, 31(2):190–222.

Margaret A. Hafer and Stephen F. Weiss. 1974. Word Segmentation by Letter Successor Varities. Information Storage and Retrieval, 10:371–385.

Rehman, K and Hussain, I. (2005 ). Unsupervised Morphemes Segmentation . In Proceedings of Morphochallenge 2005.Accessed from- http://www.cis.hut.fi/morphochallenge2005/P10_RehmanHussain.pdf on 20/02/2009

Morphology Challenge 2005 homepage-

http://www.cis.hut.fi/morphochallenge2005/ Last accessed on 24/04/2009

Page 53: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix A – Personal reflection

I have found this to be a very challenging project, I think I underestimated the amount of work

that this project required. Its important to set weekly goals and actually stick to them. I recall I

got quite laid back mid-way and also was ill for a few days which set my mid-term report behind

schedule. By looking at the schedule (Appendicies B and B2) it can be seen that I was quite

naïve in my initial plan. I took a very high level approach wheras I should have taken a low level

approach. Even now I am rushing through the last part of my report. However, I can say I have

learnt a lot in my project, I have become a much confident programmer for one, and also my

organisation skills have increased even though I have had a few set backs, but the best advice

would be being organised is a must.

When it comes to actually implementing the algorithms its important to set allowances for de

bugging as some bugs can be very annoying to solve and very time consuming. The final writeup

especially takes a lot longer than one might think. I was advised by my supervisor to start earlier

but I was too involved in the programming part. I think if I had a better plan of action I would

have been more successful.

Page 54: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix B- Plan of Schedule

Page 55: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 56: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix B2- Actual Plan

Phase I

Page 57: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Phase II

Page 58: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix C- Shows the comparison of the 4 major algorithms.

Algorithm Initial

Dataset

Actual

Sample

used

Threshol

d

Affixes

gathered

Percentage

of Affix

gathered

compared

to sample

used.

F-

Measure

%

X4 100,000 62261 >1 2476 3.976807 24.84

X5 100,000 25781 >10 199 0.771886 24.51

X6 100,000 62261 >1 3338 5.361302 28.53

X7 100,000 25781 >10 373 1.446802 28.13

Appendix D- Diagram shows flow of algorithm for the main program

Page 59: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E- Figure Shows Development steps.

Input

evalName: StringdataName: StringevalSize: intdataSize: int

getEval(evalName, evalSize)getData(dataName, dataSize)

AffixGather

dataList: String [ ]

gAffix(dataList)

Filter

frequency: intdataName: String [ ]

filterDataList(frequency, dataName)

Segmentation

affixList: String [ ]evalList: String [ ]

segment(affixList, evalList)

Trim

_array: String [ ]

trimArray(_array)

Reversal

Arrary: String[]

rerverseArray(Array)

Output

affixArray: String [ ]resultArray: String [ ]

outputLists (affixArray,resultArray)

Page 60: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E.A Shows the running of the XX1 algorithm

XX1

Provides basis for

XX2XX3

XX4

XX5

XX6

XX7

Provides basis for

Provides basis for

Provides basis for

Final Algorithm

Provides basis for

X5

XX1

Running of XX1

Gathers Prefix list and filters Data List.

Segmentation Phase

Evaluation List

Evaluation Result List

The D2 data set

Inputs Prefix and Data List

Page 61: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E1-Table shows the running of Algorithm XX1 on a word.

Word Suggestions Prefix Count Segment Chosen Gold Standard

Aeroplanes’

1 Ae roplanes’ ae 1

2 Aeroplan es’ an -1

3 Aer oplanes’ er -1

4 aeropla nes' la -1

5 Aeroplane s’ ne 1

6 Aerop lanes’ op -1

7 aeropl anes' pl -1

8 Aero planes’ ro 1

9 Aeroplanes’ S’ -1

Aero planes’ Aero plane s’

Appendix E2- Table highlight on a problem discovered in algorithm XX1

Word Suggestions Prefix Count Score Segment Chosen Gold Standard

agreeably Agreeab ly ab -1

Ag reeably ag -1

Agreeabl y bl -1

Agreea bly ea -1

Agree ably ee -1

Agr eeably gr -1

agreeably ly -1

Agreeably Agree ab ly

Appendix E3-Highlights a problem found in algorithm XX2.

Page 62: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Word Suggestions Prefix Count Segment Chosen Gold Standard

Action’s

1 Ac tion’s ac 0

2 Act ion’s Ct 0

3 Actio n’s io -2

4 Action’ s N’ 0

5 Action ‘s on 0

6 Acti on’s ti -2

Action’s Act ion ‘s

Appendix E4-Highlight the problem relative to algorithm XX3

Word suggestions prefix count Chosen Segment

Gold standard

Footing’s

1 s'gn itoof gn -2

2 s'gnit oof it 0

3 s'gni toof ni 0

4 s'gnitoo f oo 0

5 s' gnitoof S’ 0

6 s'gnito of to 0

Footing’s Foot ing ‘s

Page 63: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E5-illustrating a general problem.

Word Suggestion Prefix Count Chosen Segment

Gold Standard

Reproached

1. Re proached re 2-1=1

2. Rep roached ep 3-1=23. Repr oached pr -1-1=-2

4. Repro ached Ro -1+1=0

5. Reproa ched oa -1 -1=-2

6. Reproach ed ch 8+2=10

7. Reproache d he -9+1=-8

8. Reproached ed +10

Reproached Reproach ed

Appendix E6- illustrates a problem relative to algorithm XX4

Word Suggestions Prefix

Count Chosen Segment

Gold Standard

Telephotograph

1. Te lephotograph te 2-1=1

2. Tel ephotograph el 3-1=2

3. Tele photograph le -4+1=-3

4. Telep hotograph ep -5-10=-15

5. Teleph otograph ph -6-9=-15

6. Telepho tograph ho -7-8=-15

7. Telephot ograph ot -8-7= -15

8. Telephoto graph to -9+1= -8

9. Telephotog raph og -10-5=-15

10. Telephotogr aph gr -11-4=-15

Tel ephotograph Tele photo graph

Page 64: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix E7-Example of a problem from XX5

Word Suggestions Prefix Count Chosen Segment Gold Standard

Broadcasting

1. Br oadcasting Br -2 -10= -12

2. Bro adcasting Ro -3 -9= -12

3. Broa dcasting oa -4 -8= -12

4. Broad casting ad +5+7= +12

5. Broadc asting dc +6 -6= 0

6. Broadca sting Ca -7+5= -2

7. Broadcas ting as -8+4= -4

8. Broadcast ing st +9+3= +12

9. Broadcasting ng +12 >> 0

Broad casting Broad cast ing

Appendix E8- Highlighting another Problem, XX5

Word Suggestions Prefix Count Gold Standard

Adult’s Ad ult’s Ad 2-5= -3

Adu lt’s Du -3-4= -7

Adult’ s T’ -6+1= -5

Adul t’s Ul -4-3= -7

Adult ‘s

Appendix E9-Example of an incorrectly segmented word from algorithm XX6

Word Suggestions

Suffix Count Chosen segment Gold Standard

Believe

1. Belie ve ev +5+2=7

2. BelI eve ve -4+3=-1

3. Bel ieve ei 3-4=-1

4. Be lieve il 2-5=-3

5. B elieve le 1-6 =-5 Belie ve Believe

Appendix F-Diagram shows the running of the XX7 algorithm.

Page 65: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix G-Diagram shows the running of algorithm describe in point 4 of extensions running on the word

Aftercare’s.

Appendix H- Shows the evaluation results of all the algorithms.

Evaluation List

Result 1 (308 words)

XX5 run

Remainder List (224 words)

Un-segmented words from running XX5

Result 2 (34 words) More Remainder List (190 words)

Final Result List

XX6 runUn-segmented words from running XX6

Running of XX7

Aftercare’s

After, Count+5 Care’s

Care, Count+4

‘s, Count -2

After detected in list Care’s not in list

After care ‘s

Care in list‘s not in list

Page 66: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Algorithm Size of Evaluation set

F-Measure Precision

X4 532 24.84 18.73%

X5 532 24.51 18.33%

X6 532 28.53 18.93%

X7 532 28.13 18.70%

XX1 532 21.74% 28.54%

XX2 532 23.64% 60.43%

XX3 532 19.53% 55.83%

XX4 532 41.92% 57.37%

XX5 532 40.85% 71.43%

XX6 34 63.41% 78.79%

XX7 532 44.32% 72.14%

Page 67: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix I-Shows the results of the contestants participating in Morphology Challenge 2005 on the

English Language.

Name F-Measure PrecisionA1 49.8 44.7A2a 66.6 67.7A2b 62.4 55.2A3 32.0 24.1A4a 61.7 62.6A4b 58.5 61.2A5 53.8 50.6A6 76.8 76.2A7 48.0 47.1A8 36.2 32.5A9 28.5 22.9A10 43.7 37.5A11 45.7 57.1A12a 55.7 57.6Atif Score XX7 44.32 72.14

Table shows the F-Measure (%) & Precision(%) of the participants entries to Morphology

Challenge for the English Language. (Adapted from Kurimo et al, 2005).

Page 68: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 1, Evaluation List D1

aaboutafter allan andanyareas at be been beforebutby ca n coulddiddo firstforfromgoodgrea t had

has have he her himhisiifin intois it itsknowlikelittle mademan may memoremostmrmuch must

mynonotnowofononeonly orother ouroutover saidsee she shouldsosomesuch tha n that the thei rthem

then therethesethey thistimetotwoupuponusverywas wewellwerewhat when which who willwithwouldyouyour

Appendix 2, Evaluation List, D2

aboutaccelerateaccurstaction'sadjadult'saeroplanes'aftercare'sagreeablyairlettersalexics'allowance'salumamharic'sanalects'sanglicanism'sannual'santhropophagousapatheticallyappellationsaprils'archeryarmletsartier

asphyxiationassortingatonalaudition'sautobiographicallyavuncularbactriabalkingbandoleerbarbarity'sbarracudabastionsbazaar'sbefitbelievebeneficence'sbestiaries'bibliophiles'blackleggedbleats'bludgeoningboliviabopper'sboulevardbrandying

breastbone'sbridecake'sbroadcastingbrowbeatingbuffbullheadedness'sburdocksbushbabies'buzzers'cacklescalenderingcamelliascandelabrumscantilevers'cargocarsickness'scastaways'catechizedcave'scentaurs'chablis'chancel'scharismaticcheat'schessman

chimericalchroniccincturecleaverscliquierclottingcoadjutor'scochineal'scoercioncolonizescomfitscommodescompensatescomprehensive'sconciergesconduit'sconfucius'sconnection'sconsolidations'consumercontoursconvenecopyholdercornscorse's

Page 69: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

cottontailcounterpointcovencrabgrass'scravatscremationcrispestcruciallycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathlesslydeceivers'decordefectsdegrees'delta'sdemoralizedepilation'sderidedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdisassociatingdiscountsdisgruntlesdismemberment'sdisputesdistastefulness'sdivergences'docketsdogwooddoorcasesdozeddrawingsdruggistdues'duress'eaglets'editionseggcupselderflower'selision'sembezzlementemperorsencodesengineenslavesentries'epsom's

eructatedestimators'evaluatingexactitudeexcretionexilesexpletive'sextenuationeyedropfactorizedfalsehood'sfarrier'sfaultilyfeignedfestivals'fieldmicefilmstripsfirthsflierflorists'flutes'foldfooting'sforegroundforkliftfossilizedfoxhuntingfreshersfrizzlesfryer'sfundamentals'fustygallinggapgaslightsgazinggent'sgharrygirdersglibglutengodparent'sgoods'grabs'granule'sgreatestgrimacesguardedgullies'gutturalshaghankeringsharmonic'shatfulsheadachierhearth's

heehaws'henbanesheterogeneouslyhipposhogsheadshominghootershorsewhips'housefatherhungryhydraulics'idiomilluminationimmunizationimpingements'improbableinbornincompetencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitivelyinspectorships'intakeinterjections'interringintroversion'sinvoiceirrelevances'iviesjanitor'sjerkyjobberjoules'jumblekappas'kibbutznikkinkedknight'skris'laddieslamps'lapelslatheslaxations'learnlegalizesleprechaun'sliability'sliedlights'linearliquorslivid

lockup'slongbowlorlovelornlugubriousness'luxuriantlymadeirasmaids'malcontentedmandarin'smanometricmargin'smarrowsmasseusematinsmaypole'smediatingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'missusmocha'smolehills'monkeys'mooniermorphemicsmotormouthwashesmulberrymuscatelmutt'snakednessnationalizationnearsightedness'negressesneurosesnicernile'snocturne'snonviolentnotaries'nuggets'nymphoobliteration'soccidentals'offeringokayingontology'sopposingordinalorthodontics'

Page 70: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

outclassoutrages'overgrowths'overstatesoystercatcherspailfulspalliation'spanjabiparablepardoningparsonagespassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pinpointedpitchersplainchant'splatitudinouspleatplumberpoetasterpolicewomanpolypiportcullises'postcardpotionspractisedprecepts'prefabricationprepositionpressgangingprickedprioress'procrastinateprognosis'spromptnesspropositionsprotesterprunerspullets'punnetpursuespyrex's

quarantine'squestionnaires'quixoteraconteurs'railing'sraneesrate'sreams'reccesreconcilement'sredbreastrefereedrefutation'sreifiesrelictremountsrepayments'reproachedreservistrespondingretardationrevaluerevolvers'riddledrills'river'srondorotaroutines'ruefulruntssacking'ssages'salonssandals'sarongssauternesscallopedscientologistscowls'scripscurf'sseance'ssecretions'seesseneschal'ssepulchralservitudesextonshamelessnesssheaths'shiver'sshowrooms'

sibilant'ssiege'ssilksittings'skidpansskydiving'sslavers'slimslothfulness'ssmallssneerersociables'solariums'somnambulismsortie'ssouvenirspectresspigotspivviestspools'squanderer'sstabilizedstalingstaresstayedstiffenersstockbreeders'stoolies'straightensstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicantsurprise'sswanksswordfishsyndicaliststaborstales'tannin'starsitaxonomytec'stelephotographtenderfoottermite'sthai'sthermoplastic'sthistledown's

thrivesthwartstigress'ting'stontinestorments'toughnesstracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertypedumpteensunctuousness'undersecretariesuneventfulunityunrollsuphillusherettesvagueness'svariance'svenalvermifugevibrancy'svillainviscountcyvodkavulgarianswaitwant'swarrenwatchdogwaterloggedweariedweighingwestchesterwherrywidenwinterierwitticismswoodworkwretchedlyyam'sziggurats'

Page 71: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 3, Sample Results X1

aaabacaa haa raa aa'bab aab bab cab dab eab iab lab nab oab rab uaa ba'cac bac eac hab e nab e rab e taa 'sab i mab i tab b aab l eab l yab b eab b yab o oab o uaa geab r aab r iab soab thab a dab u hab u lab u nab u tab wy

ab d iab d uac adab a sac caac daac deab e dac e dac e hac e pac e sac e yab e lac h eab e t sab horab a teab i chab i deaa r onab i neaa ltoab jurab kerab b a dab l ahab b a sab l e rab l e sab a baab b éab n erab n orab b e sab o boab o deab o feab o ieab b e yab o o nab o o tab o rtab b ieab o u dab o u rab o u t

ab o veab b otab ac kab r ac ab r a mab r a raa staab r i sab r ouab r usab a d iab d ekab d elab a ftab u jaab d i cab d ouab u seab a jaab u t sab u zzab d u lab yanab ylaab yssab d u rab d u sab a loac aiaab e amab e arac cesac choac comac corac craac cusab e baab a naab e ggab a ndab e l lac e neaa r auab a s eac e s 'ac e to

ab e r tab e stab a s hab a d laab a eteab o lieab o lusab o ndsab b ac yab d y'saa h oroab o radab o rdaab o rdeaa a wwwab o rt oab b a 'sab b a teab o u ndab b a yeab a bbaab ac h aaa chenab e l l eab e l 'sab r a deab b e 'sab r a m oab r a m sab b é sab r egeab b e s sab e r t oab r oadab r ochab a nozab r uptab b e y sab sentab a risab so rbab squeab stinab surdab goveab hand

Page 72: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 4, Sample Result List of X2

Prefixes Gathered.abaaac

Samples from Result Listaa ab ac aa haa raa aa'bab aab bab cab dab e

ab iab lab nab oab rab uaa b a'cac bac eac hab en

ab erab etaa 'sab imab itab baab leab lyab beac edac ehac ep

ac esac eyab elac heab etsab horab ateab ichab ideaa ronab ineaa lto

ab jurab imesab yaneab ydosab cdefab iolaab ishaab asesac aciaab jectac adiaab adesac arusab jureaa rgauac cedeac censac centac ceptab deraac cess

ab dhurac cipeac ciusab lainac consab lardac cordac costab laufac crueab lazeac cuseab atedab diasab atesab lestab di'sab ditaab doolac eras

ab atisab azaiac eticab oardab duciac eyteab ductac cordionab ilitiesab scessesab schwungab utilonsac costingab lehnungac couchedab utmentsab ernethyac countedab ominablyac etab ulum

ab erconwayab ominatedac cessibleab ominatesac etonemiaac etosellaab botstownac cessionsab solutelyab bassidesab solutionab solutismac cidentalac cidentedac haemenesac cidentlyab solutistab solutoryac cipenserab reac tion

ab erdare'sab delhadizab erdeen'sab delkaderac claimingac harniansab ricosoffac clerlateac climatedac cidentallac cidentaly

ab surditiesab dujaparovab delrahmanab breviatedac arnaniansab iogenesisac clamationab derrahmanac cusationsab sorbinglyac climatise

ac climatizeac clivitiesab delkhalekab yssiniansac coceberryab origines'ab sorptionsab andonmentac commodateac customaryab errations

ac comodatedac customingac comodatesac companiedac companiesac companistac celeratedac celeratesac celeratorac complicesab senteeism

Page 73: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 5, Sample Result of X3

ab d el karimab so rb ent sab o riginalab b ot t abadac com odateab yss inia nab o rigine sac com pli ceac com pli shab r i dgmentab so rptionab so rptiveab hor renceac cor d anceab a te ment sab e r gwyllyab d i c atingab stain ingab huechundac ad emi c alac cor d natsab stemiousab stentionab d i c ationab o rt ive lyab stin enceab n egatingab n egationac coucheesac coucheurab r ogatingab str ac t edac count antab str ac t lyac count ethac count ingab r ogationab ject nessab o u halimaab e r nathy sab d u ct ion sac anthi ansac creditedab jur ationab n or mal lyab d el la tifac cretion sac anthylisab r upt nessab a nd on ingab e r rationac arnania nab d el salamac cuminateac cumulateab th eilungab treibungac curate lyab d el wahab

ab d el wahebab schnittsac cus ationac cus atoryab o ard shipab o ve boardab e r yswithab d ominal sab scond ersac cede ntemab scond ingab khazi a n sac cus ing lyac celerateab khazi a 'sac cus tom e dab a nd on nedab u n dance sab b reviateab u n dant lyab d era maneab d alla h 'sac cent uateab e n dsternab o lish ingac e p halousac cept ab l e ac cept ab l y ac cept a nceac e rbitiesab d u r ahmanab o lition sab b e rationab sent ionsab o minableab r a m o witzab o minablyac e t ab u l umab e r conwayab o minate dac ces s ibleab o minate sac e to ne miaac e to sellaab b ot s townac ces s ion sab so lute lyab b a s side sab so lutionab so lutismac cident alac cident edac h aemenesac cident lyab so lutistab so lutoryac cipe nserab r e ac tionab e r dare 's

ab d el hadizab e r deen 'sab d el kaderac claim ingac h arniansab r i cosoffac clerlateac climatedac cident al lac cident al yab surd itiesab d u japarovab d el rahmanab b reviate dac arnania n sab i ogenesisab e r crombieac cumulate dac cumulate sac cumulatorab r i dge mentab e r nethy 'sab o ve groundab o eocritusab o riginal sac clamationab d errahmanac cus ation sab so rb ing lyac climatiseac climatizeac clivitiesab d el khalekab yss inia n sac coceberryab o rigine s 'ab so rption sab a nd on mentac com modateac cus tom aryab e r ration sac com odate dac cus tom ingac com odate sac com paniedac com paniesac com panistac celerate dac celerate sac celeratorac com pli ce sab sent e e ismab d u l rahmanab b e ration sab o rt ion istab i gail shipab e l white 'sab hominableac ad emi c al s

ac cent uate dac cent uate sac ad emi c ianac cor d in g lyac ad emi c ienac e s todorusab stention sab r a m o vitchab e r ystwithac cept a nce sab o minabiliac cept a tionab n or mal ityab stin ent iaab e r ystwythab a nd on ed lyab e r gavennyab str ac t ingac count ab l e ac count ancyab str ac t ionac count ant sab u sive nessab str ac t iveab d era hmaneac anthaceaeab so lution sac ces s oriesac ces s oriusab str ac t o rsac ces s placeab o minatingac creditingab o minationab o u nd ing lyab a ntidas 'sab so lute nessab str ac t nessac ad emi c ian sab i t urientenac clamation sab e n cerragesab d ou japarovac cept a tion sac couchementac climatise dab a nd on ment sac climatize dab struse nessab r i dge ment sab o mination sab o u t ivenessac count ant 'sab b ot s bury'sab o rt ion ist sac cede ntibusac com mod ab l e ab b reviatingac com modate d

Page 74: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 6a, Sample of Prefixes X4

deaxdgakdhayabdiadazalbadndod'dpdqdrdudw

dyeaebecedeee'efegeheiejekelbeemeneoepeq

ereseteuevewexeyezb'faambifdfekhkiklaskn

kokrktkukvkwkylal'aicrleaaliatcsctlkllcu

lnlolpluaulvlxlycxlzczmadam'maravam'ama'masdai

matmaumawaiwmaxavemaymazdaldammeamecdanmedmeemegmehmeimelmem

darm'emenmepmerme'mesademetmeudasmewmezavimfudatmhodaumiamic

davmidmiedawmihmikmilmimmindaymirmismitdazmixmizavoavrddamll

ajadeammemndmnlmoaawamobmocmoddebmoemohmoimojmokmoldecmonded

moomopdeemormosmotmoumovmowmozawedefmplmp'mramrcdegmro

Appendix 6b, Sample of Results from running X4 on Evaluation Set D1:

aab outaf ter al lan an dan yar eas

at be be en be fo rebu tby ca n co ulddi d

do fi rstfo rfr omgo odgr ea t ha d ha s ha ve

he he r hi mhi siifin in tois

it it skn owli k eli t tle ma d ema n ma y me

mor emos tmrmuc h mus tmynonotnow

ofononeonly orothe r ouroutover

sai dsee she sho uldsosomesuch tha n tha t

the the i rthe m the n the r ethe s ethe y thi stim e

totwoupuponusver ywas wewel l

wer ewha t whe n whi ch who wil lwit hwouldyou

your

Page 75: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 6c, Sample of Results from running X4 on Evaluation Set D2:

ac ce le r at eac cu rst ac tio n's ad j ad ult'sae ropla n es 'af ter ca r e' sag ree a b ly ai rle t ter sal ex ic s 'al lo wan ce 'sal umam ha r ic 'san al ec t s'san gl ic a ni sm' san nual 's an thro popha g ousap at he t ic a l ly ap pel la t io nsap ril s'ar ch er yar mle t sar tier as phy xia t io nas sortin gat onal au di tio n'sau tobi ogr ap hi ca l ly avu ncu la r ba c tria ba l ki n gba n do le e rba r bar i ty'sba r rac uda ba s tio nsba z aa r's be fi t be li ev e be nef ic e nce' s be stia r ies ' bi bl io phi le s ' bl ac kl eg ge d bl ea t s' bl udg eo nin g bo li v ia pit ch er s pla i nch an t's pla t it udi nous ple a t plu mbe r poet as ter poli c e woma n poly pi portcu ll i s es ' postca r d potio ns prac tis ed prec e pts'

Page 76: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

pref ab ric a t io n

Appendix 7a, Sample of Prefixes X5

Prefixa'aaabacadaeafagahaiajak

alamanaoapaqarasatauavaw

axayazb'babebhbibjblbobr

bubwbybÃc'cacdcechciclcn

cocrcsctcucyczd'krkukwky

l'lalelilllolplulvlxlym'

mambmcmdmemlmmmnmompmsmu

Appendix 7b, Sample of Results from running X5 on Evaluation Set D1

ab outaf ter al lan an dan yar eas at be be en be fo rebu t

by ca n co ulddi ddo fi rstfo rfr omgo odgr ea t ha d ha s ha v e

he he r hi mhi siifin in tois it it skn owli k e

li t tle ma d ema n ma y me mo remo stmrmu ch mu stmy nonot

nowofononeonly orothe r ouroutover sai dsee she

sho uldsosome such tha n tha t the the i rthe m the n the r ethe s ethe y

thi stim etotwoupuponusver ywas wewel lwer ewha t

whe n whi ch who wil lwit hwouldyouyour

Appendix 7c, Sample of Results from running X5 on Evaluation Set D2:

ac ce le r at e ac cu rst ac tio n's ad j ad ult's ae ropla n es ' af ter ca r e' s ag ree a b ly ai rle t ter s al ex ic s ' al lo wan ce 's al um am ha r ic ' s an al ec t s's ba l ki n g ba n do le e r ba r barit y's ba r rac uda ba s tio ns ba z aa r's be fi t be li ev e be nef ic e nce' s

br id e ca k e' s br oad ca s tin g br owbe at in g bu ff bu ll he a d ed nes s's bu rdo cks bu shba b ies ' bu zzer s' ca c kl es co nci er ge s co ndu it 's co nfu ci us's co nnec t io n's co nsoli d a t io ns' co nsume r co ntours co nven e co pyho lde r de at hle s sly de ce iv er s' de co r de fe c t s

de gr ee s' ex ac tit ude ex cr et io n ex il es ex ple t iv e' s ex ten uat io n ey ed r op fa c toriz ed fa l seh ood' s fa r rier 's nonvio le n t per am bu la t ors' per ip hras is per sec u tor's pes os' pha r is ai sm' s pho nec a l ls' phy sio lo gi sts' sau ter nes souven ir strep tomy ci n stupef ac tio n subl ieu ten an ts'

subten an ts' tel ep ho togr ap h ten de rfo ot ter mi t e' s tha i 's thi stle d o wn's thriv es thw ar ts tig r es s' tin g's unct uousnes s' unit y unroll s uphi ll ushe r et tes vag uen es s's var ia n ce 's wea r ied wei gh in g wes tch ester whe r ry wid e n win ter ier

Page 77: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

wit tic i sms woodw ork

wret ch ed ly yam 's

Page 78: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 8a, Sample of Prefixes X6

utauteuthutiutoutsuttutuutyutzuumuusuutuvuvauveuwuxuxeuyuyauyeuysuytuyuuz

uzauzeuzhuziuzuuzyuzzv'nv'svavacvaevahvaivalvamvanvarvasvatvaxvayvdvevecved

vegvelvenvervesvetvexveyvezviviavicvidvieviivikvilvinviovirvisvitvivvixvkavna

vovodvolvonvorvosvotvowvoyvrvravrevrovryvsvs'vskvtvuvuevumvusvyvynw'dw's

wawahwakwalwamwanwarwaswatwaxwaywbawbywchwdwdsweweawebwedweewegweiwelwenwer

wesweywfuwhwiwiewigwinwkwkswlwlswlywmwnwndwnewnswnywogwolwoowrwsws'wse

wsywthwulwywydwynwzyx'sxaxasxexedxeixelxenxerxesxeyxixiaxicxiexiffusxiixinxip

xirxisxitxivxixxoxonxorxtxtaxtsxusxvxvixxxyxyly'dy'syayae

Appendix 8b, Sample of Results from running X6 on Evaluation Set D1

aab ou taf te r al lan an dan yar eas at be be en be fo re bu tby ca n co ul d di ddo fi rs tfo rfro m

go od gr ea t ha d ha s ha v ehe he r hi mhi siif in in to is it it skno wli k eli t tl e ma d ema n ma y

me mo re mo st mr mu ch mu st my no no tno wof on on e on ly or ot he r ou rou tov er sa i dse e sh e

sh o ul d so so m e su ch th a n th a t th e th e i rth e m th e n th e r eth e s eth e y th i sti m eto twoup up o n us ve r ywa s

Page 79: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 8c, Reversed Sample of Results of X6 on D2:

tu ob a et ar el ec c a ts r u cc a s' no it ca jda s' tl u da 's en al po r ea s' er ac re t fa yl ba e er ga sr e t te l ri a 's ci xe l a s' ec n aw ol l a mu la s' ci ra h ma s' s tc e la n a s' msi n ac il gn a s' la u nn a su og a h po p or ht na yl l a c it eh ta p a sn o it al le p pa 's li r pa yr e h cr a st el mr a re i tr a no it ai xy hp sa gn it ro s sa la n ot a

s' no it id u a yl l a c ih p ar go ib o tu a ra l ucn uv a ai rt ca b gn ik la b re e lo dn ab s' yt i r ab ra b ad uca r ra b sn o it sa b s' ra a za b ti f e b ev ei le b s' ec n eci fe n eb 's ei ra i ts e b 's el ih p oi l bi b de gg e l kc a l b 's ta e lb gn in oe g du lb ai vi l ob s' re p po b dr av el uo b gn iyd n ar b s' en ob 's re v el it na c og r ac s' s se n kc i sr a c 's ya w at sa c

de zi h c e ta c s' ev ac

's ru at ne c 's il ba h c s' le c n ah c ci ta m si r ah c s' ta e hc na m ss e h c la c ir em ih c ci no rh c er ut c n ic sr e v ae lc re i uqil c gn it to l c s' ro t uj da o c s' la e ni h c o c no ic r eo c se z in ol o c st if mo c se d o mm oc se t as ne p mo c s' ev is ne h er

Appendix 8d, Re-Reversed Sample of Results of X6 on D2:

a bo ut a c ce le ra te a cc u r st ac ti on 's adj ad u lt 's ae r op la ne s' af t er ca re 's ag re e ab ly a ir l et t e rs a l ex ic s' a l lo wa n ce 's al um am h ar ic 's a n al e ct s 's a ng li ca n ism 's a nn u al 's an th ro p op h a go us a p at he ti c a l ly ap p el la ti o ns

ap r il s' a rm le ts a rt i er as ph yx ia ti on as s or ti ng a to n al a u di ti on 's a ut o bi og ra p hi c a l ly a vu ncu l ar b ac tr ia b al ki ng ba nd ol e er b ar ba r i ty 's b ar r acu da b as ti o ns b az a ar 's b e f it b el ie ve be n ef ice n ce 's b e st i ar ie s' b ib l io p hi le s'

b l a ck l e gg ed bl e at s' bl ud g eo ni ng bo l iv ia b op p er 's b ou le va rd b ra n dyi ng br e a st bo ne 's br i de ca ke 's br o ad c as ti ng b r owb ea ti ng bu ff bu llh e ad ed n es s 's bu rd o c ks bu sh b ab ie s' b uz z er s' c a ck l es ca le n de ri ng ca m e ll i as ca n de l a br u ms c an ti le v er s'

ca r go c a rs i ck n es s 's c as ta w ay s' c at e c h iz ed ca ve 's c en ta ur s' c h ab li s'

c ha n c el 's c ha r is m at ic ch e at 's c h e ss m an c hi me ri c al c hr on ic ci n c tu re cl ea v e rs c liqu i er c l ot ti ng c o ad ju t or 's c o c h in e al 's

Page 80: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 9a: Sample of Prefixes X7

'd'l'm'n'r's'taaabacadaeaf

agahaiakakrakualamanaoapaqar

asatauavawaxayazbabbbebibo

bsbtbubycacechcickcocqcrcs

ctcydadddedidldndodsdtdudy

eaebecedeeefegeheiekelemen

ensentenyeoepeqerereergesesketeu

eumevewexeyezfafefffifo

Appendix 9b, Sample of Results from running X7 on Evaluation Set D1

c r is p e st cr uc i a l ly c ub ic c upb ea r er s' c ur t a il m e nt 's cyc la ma t es d a ff o d il s' da nd ru ff 's d ea th l es s ly d e c ei v er s' d e c or d ef ec ts d e gr ee s' d el ta 's d e mo ra li ze de pi la ti on 's d er i de de so la ti ng de t hr on ed d ew l ap 's d ic ta p ho ne s' di l d os d ip p er 's di s a ss oc ia ti ng di s co un ts d is gr u nt l es di sme m b er m e nt 's di s pu t es d is t as t ef ul n es s 's di v er ge n ce s' d oc ke ts

d ogwo od d o or ca s es d oz ed d ra wi n gs d ru g g i st d ue s' du r es s' e ag l et s' e di ti o ns e ggc u ps e l de rf lo w er 's e li si on 's e m b e zz le m e nt em pe r o rs e nc o d es e n gi ne e ns la v es e ntr ie s' e ps om 's er uc t at ed es ti ma t or s' e v al ua ti ng e x ac t it u de exc re ti on e xi l es exp le ti ve 's ex t en ua ti on ey ed r op f ac to r iz ed f al s e ho od 's f a rr i er 's fa ul t i ly fe i gn ed

f es ti v al s' f ie ldm i ce f i l m st r i ps f ir t hs fl i er f lo r i st s' f lu te s' fo ld f oo ti ng 's fo r eg r ou nd fo rkl i ft f os si l iz ed f oxh un ti ng f r es h e rs f r i zz l es fr y er 's f un da m en t al s' f u s ty g al li ng g ap ga sl i gh ts ga zi ng g e nt 's g h a r ry gi r d e rs gl ib g lu t en g od pa r e nt 's go od s' g r ab s' g r a nu le 's gr ea t e st g ri m a c es

g ua rd ed gu ll ie s' gu t tu r a ls h ag h an ke ri n gs ha rm on ic 's h atf u ls h ea d a ch i er h e ar th 's h ee h aw s' h en ba n es he te ro ge n e ou s ly h ip p os h og sh e a ds ho mi ng h oo t e rs h or s ew h ip s' h ou se f at h er hu n g ry h yd ra ul ic s' i di om i l lu mi na ti on i mmu ni za ti on im p in ge m e nt s' i mp ro b ab le i n bo rn inc om pe te n cy i nd e c ip he r a bi l i ty 's i n d iv id ua l is ts i n exo r ab ly i nf l ow 's

Page 81: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

i n it ia te i nq ui si ti v e ly in sp ec t or s h ip s' in ta ke in t erj ec ti on s' in t er ri ng i nt ro v er si on 's in vo i ce ir re le va n ce s' iv i es

ja ni t or 's je r ky jo b b er j ou le s' ju mb le k ap pa s' ki bb u tz n ik ki nk ed kn i g ht 's k ri s' l a dd i es

l a mp s' la p e ls l at h es la xa ti on s' l ea rn le ga li z es l epr ec ha un 's l i a bi l i ty 's li ed l i g ht s' l in e ar

liq u o rs l iv id l o ck up 's lo ng b ow l or lo v e lo rn l ug ub r io us n es s' l ux ur i an

Appendix 9c, Sample of Results from running X7 on Evaluation Set D2

a bo ut ac ce le ra te acc u r st ac ti on 's adj ad u lt 's ae r op la ne s' af t er ca re 's ag re e ab ly a ir l et t e rs a l ex ic s' a to n al a u di ti on 's a ut o bi og ra p hi c a l ly avu ncu l ar b a ctr ia b al ki ng ba nd ol e er b ar ba r i ty 's b ar r acu da b as ti o ns b az a ar 's

b e f it b el ie ve be n ef ice n ce 's b e st i ar ie s' b ib l io p hi le s' b l a ck l e gg ed bl e at s' bl ud g eo ni ng bo l iv ia b op p er 's b ou le va rd b ra nd yi ng br e a st bo ne 's br i de ca ke 's br o ad c as ti ng b r owb ea ti ng bu ff bu llh e ad ed n es s 's bu rd o c ks bu sh b ab ie s' b uz z er s' c a ck l es

ca le n de ri ng ca m e ll i as ca n de l abr u ms c an ti le v er s' ca r go c a rs i ck n es s 's c as ta w ay s' c at e c h iz ed ca ve 's c en ta ur s' c h ab li s' c ha n c el 's c ha r is m at ic ch e at 's c h e ss m an c hi me ri c al c hr on ic ci n c tu re cl ea v e rs c liqu i er c l ot ti ng c o ad ju t or 's c o c h in e al 's

c oe r ci on c o lo ni z es c om fi ts co mm o d es c om p en sa t es c o mp re h en si ve 's co nc i er g es co nd u it 's co nf u c iu s 's co nn ec ti on 's con so li da ti on s' c o n su m er c on to u rs c on ve ne c o py ho l d er co r ns c or se 's c o tt on t a il co un te rp o i nt co v en

Page 82: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 10, Results of algorithm XX1

aboutaccelerate ac curstaction 'sad jad ult'saero planes'after care'sagreeably air lettersale xics'allow ance'salum am haric'sana lects'sang licanism'san nual'san thropophagousapathetic allyappellation sapril s'archery ar mletsar tierasp hyxiationas sortingat onalaudit ion'saut obiographicallyav uncularba ctriaba lkingband oleerbarbarity 'sba rracudabast ionsba zaar'sbe fitbelieve bene ficence'sbest iaries'bi bliophiles'blackleggedbleat s'bludgeon ingbolivia bo pper'sboulevard brand yingbreast bone'sbride cake'sbroadcast ingbrow beatingbuffbull headedness'sbur docksbus hbabies'buzzers'ca cklescalender ingcame lliasca ndelabrumsca ntilevers'cargo ca rsickness'scast aways'

ca techizedcave 'scentaur s'ch ablis'chancel 'sch arismaticche at'sche ssmanchime ricalchronic ci ncturecleave rscl iquierclo ttingcoadjutor 'scochin eal'scoercion colon izescom fitscom modescompensate scomprehensive 'scon ciergescon duit'sconfucius 'scon nection'scon solidations'con sumercontour scon venecopy holderco rnsco rse'scot tontailcounter pointcove ncrab grass'scravats cremation crisp estcrucial lycub iccup bearers'cur tailment'scyclamatesda ffodils'da ndruff'sdeath lesslydeceive rs'dec ordefects degrees 'delta 'sdemoralize de pilation'sder idedesolating dethroned dew lap'sdicta phones'di ldosdip per'sdis associatingdiscounts dis gruntlesdis memberment's

dispute sdistasteful ness'sdive rgences'doc ketsdog wooddoor casesdoze ddrawing sdrug gistdue s'du ress'eagle ts'edition seggcupselder flower'seli sion'sem bezzlementemperor sen codeseng ineenslave sentries 'epsom'ser uctatedest imators'eva luatingex actitudeexcretion exile sex pletive'sext enuationeye dropfacto rizedfalse hood'sfa rrier'sfa ultilyfeign edfe stivals'fi eldmicefilms tripsfirth sflier florist s'flute s'fol dfoot ing'sfore groundfor kliftfossil izedfox huntingfresh ersfr izzlesfry er'sfun damentals'fust ygalling ga pga slightsgazing ge nt'sgharrygird ersglib glut engod parent'sgoo ds'

gr abs'gr anule'sgreatest grim acesguard edgull ies'gut turalsha ghank eringsha rmonic'sha tfulshe adachierhearth 'she ehaws'he nbanesheterogeneous lyhip poshog sheadshoming hoot ershorse whips'house fatherhun gryhydra ulics'idiom illumination im munizationimp ingements'imp robablein bornin competencyind ecipherability'sind ividualistsinexorably in flow'sinitiate inquisitive lyinspector ships'intake in terjections'in terringin troversion'sin voiceirrelevances'iv iesjanitor 'sjerky job berjo ules'jumble

Page 83: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 84: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 85: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 86: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 87: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 88: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 89: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 90: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 91: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 92: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 93: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 94: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 95: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 96: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 97: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 98: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 99: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 100: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 101: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 102: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 103: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 104: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 105: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 106: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 107: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 108: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 109: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 110: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 111: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 112: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 113: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 114: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 115: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 116: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term
Page 117: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

ka ppas'kibbutznikkin kedknight 'skr is'la ddieslamps 'la pelslath esla xations'le arnlegalize sle prechaun'sliability 'sli edlights 'line arliquor sli vidloc kup'slon gbowlor love lornlugubrious ness'luxuriantly madeira sma ids'ma lcontentedma ndarin'sma nometricma rgin'smarro ws

Page 118: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

masse usema tinsma ypole'sme diatingmelodious ness'sme rcer'sme stizos'me zzosmi dshipmen'smine laying'smisadventure 'smi sfires'missus mo cha'smol ehills'mon keys'moon iermor phemicsmot ormouth washesmulberry mus catelmu tt'sna kednessnation alizationne arsightedness'ne gressesneuro sesni cerni le'sno cturne'snon violentnot aries'nuggets 'nymph oobliteration 'soccidentals'of feringok ayingonto logy'sop posingor dinalor thodontics'out classout rages'over growths'over statesoyster catcherspa ilfulspa lliation'spa njabipara blepa rdoningpa rsonagespa ssionflowers'path finder'spa wkiness'spe elers'pe ninsulape rambulators'pe riphrasispe rsecutor'speso s'

ph arisaism'sph onecalls'physiologists 'pi llars'pi npointedpi tcherspl ainchant'spl atitudinouspl eatplum berpo etasterpo licewomanpo lypipo rtcullises'post cardpo tionspractise dprecepts 'pre fabricationpre positionpre ssgangingpri ckedpri oress'pro crastinatepro gnosis'spro mptnesspropositi onsprotest erpr unerspull ets'pun netpursue spyre x'squ arantine'sque stionnaires'quixote ra conteurs'ra iling'sra neesrate 'sre ams're ccesre concilement'sre dbreastre fereedre futation'sre ifiesre lictre mountsre payments're proachedre servistre spondingre tardationre valuerevolve rs'riddle drill s'rive r'sro ndoro tarout ines'rueful

run tssa cking'ssa ges'sa lonssa ndals'sa rongssaute rnessc allopedsc ientologistsc owls'sc ripsc urf'sse ance'sse cretions'se esse neschal'sse pulchralse rvitudese xtonsh amelessnesssheath s'sh iver'ssh owrooms'si bilant'ssi ege'ssi lksi ttings'sk idpanssk ydiving'sslave rs'slim slothful ness'ssmall ssneer erso ciables'so lariums'so mnambulismso rtie'sso uvenirsp ectressp igotsp ivviestsp ools'sq uanderer'sst abilizedst alingst aresst ayedst iffenersst ockbreeders'st oolies'str aightensstr eptomycinst upefactionsub lieutenants'sub tenants'su ffragan'ssum mers'sun tanssup plicantsur prise'ssw ankssw ordfish

syndicaliststa borsta les'ta nnin'sta rsita xonomyte c'ste lephotographte nderfootte rmite'sth ai'sth ermoplastic'sth istledown'sthrive sth wartsti gress'ti ng'sto ntinestorments 'to ughnesstr acksuittr ammelledtr ansliteratingtr avelogue'str endsetters'tr ilogytr uncatetu lip'stweezertype dum pteensunctuous ness'un dersecretariesun eventfulun ityun rollsup hillus herettesva gueness'sva riance'sve nalve rmifugevi brancy'svi llainvi scountcyvo dkavu lgarianswa itwa nt'swa rrenwa tchdogwa terloggedwe ariedwe ighingwe stchesterwherrywidenwinterierwitticismswoodworkwretchedlyyam'sziggurats'

Page 119: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 11, Results of XX2

ab outaccelerateaccurstaction'sad jadult'saeroplanes'aftercare'sagree ablyair lettersalexics'allowance'sal umamharic'sanalects'sanglicanism'sannual'santhropophagousapathetic allyappellation saprils'archer yarmletsar tierasphyxiationas sortingatonalaudition'sautobiographical lyavuncularbactriabal kingbandoleerbarbarity'sbarracudabast ionsbazaar'sbe fitbelievebeneficence'sbestiaries'bibliophiles'blackleggedbleats'bludgeon ingbo liviabopper'sboulevardbrand yingbreastbone'sbridecake'sbroadcast ingbrow beatingbuffbullheadedness'sbur docksbushbabies'buzzers'cacklescalender ingcamelliascandelabrumscantilevers'car gocarsickness'scastaways'catechized

cave'scentaurs'chablis'chancel'scharismaticcheat'schessmanchimericalchroniccincturecleave rscliquierclottingcoadjutor'scochineal'scoercioncolonizescom fitscom modescompensate scomprehensive'sconcierge sconduit'sconfucius'sconnection'sconsolidations'consume rcontour sconvenecopy holdercornscorse'scotton tailcounter pointcove ncrabgrass'scravat scremationcrisp estcrucial lycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathlesslydeceivers'dec ordefect sdegrees'delta'sdemoralizedepilation'sder idedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdis associatingdis countsdisgruntlesdismemberment'sdispute sdistastefulness's

divergences'docket sdog wooddoor casesdoze ddrawing sdrug gistdues'duress'eaglets'edition seggcupselderflower'selision'sembezzlementemperor sen codesengineenslave sentries'epsom'seructatedestimators'evaluatingexactitudeexcretionexile sexpletive'sextenuationeye dropfactorizedfalsehood'sfarrier'sfaultilyfeign edfestivals'fieldmicefilms tripsfirth sflierflorists'flutes'fol dfooting'sfore groundforkliftfossilizedfox huntingfresher sfrizzlesfryer'sfundamentals'fust ygall ingga pgas lightsgazinggent'sgharrygirdersglibglut engod parent'sgoods'grabs'granule'sgreat est

grim acesguard edgullies'guttural sha ghankering sharmonic'shatfulsheadachierhearth'sheehaws'henbanesheterogeneous lyhipposhogshead sho minghootershorsewhips'house fatherhungryhydraulics'idiomilluminationimmunizationimpingements'im probablein bornin competencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitive lyinspector ships'in takeinterjections'inter ringintroversion'sin voiceirrelevances'iviesjanitor'sjerkyjob berjoules'jumblekappas'kibbutznikkinkedknight'skris'lad dieslamps'lap elslath eslaxations'lear nlegalize sleprechaun'sliability'sli edlights'line arliquor sliv idlockup's

Page 120: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

long bowlo rlove lornlugubriousness'luxuriantlymadeira smaids'mal contentedmandarin'smanometricmargin'smarrow smasse usema tinsmaypole'smedia tingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'missusmocha'smolehills'monkeys'mooniermorphemicsmot ormouth washesmulberrymuscatelmutt'snaked nessnationalizationnearsightedness'negressesneuro sesnice rni le'snocturne'snon violentnotaries'nuggets'nymph oobliteration'soccidentals'offer ingokay ingontology'sop posingordinalorthodontics'out classoutrages'overgrowths'over statesoyster catcherspailfulspalliation'spanjabipar ablepardon ingparson ages

passionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pin pointedpitcher splainchant'splatitudinouspl eatplum berpoetasterpolice womanpolypiportcullises'post cardpo tionspractise dprecepts'pre fabricationpre positionpressgangingprickedprioress'procrastinateprognosis'sprompt nesspro positionsprotest erprune rspullets'pun netpursue spyre x'squarantine'squestionnaires'quixoteraconteurs'railing'sraneesrat e'sreams'reccesreconcilement'sred breastreferee drefutation'sreifiesrelic tre mountsrepayments'reproach edreservistrespond ingretardationre valuerevolvers'riddle d

rills'rive r'sron doro taroutines'rue fulrun tssac king'ssages'salon ssandals'sarongssauternesscallopedscientologistscowls'sc ripscurf'sseance'ssecretions'se esseneschal'ssepulchralservitudesex tonshamelessnesssheaths'shiver'sshowrooms'sibilant'ssiege'ssilksittings'skid pansskydiving'sslavers'slimslothfulness'ssmall ssneer ersociables'solariums'somnambulismsortie'ssouvenirspectre sspigotspivviestspools'squanderer'sstabilizedstalin gstare sstay edstiffenersstockbreeders'stoolies'straighten sstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicant

surprise'sswankssword fishsyndicaliststaborstales'tannin'star sitaxonomytec'stelephotographtender foottermite'sthai'sthermoplastic'sthistledown'sthrive sth wartstigress'ting'stontinestorments'tough nesstracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertype dumpteensunctuousness'under secretariesun eventfulunit yun rollsup hillusherettesvagueness'svariance'svena lvermifugevibrancy'svill ainviscountcyvodkavulgarianswa itwant'swarrenwatch dogwater loggedweariedweigh ingwest chesterwherrywi denwinterierwitticism swood workwretched lyyam'sziggurats'

Page 121: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 12, Results list for XX3

ab outaccelerateaccurstaction'sadjadult'saeroplanes'aftercare'sagree ablyair lettersalexics'allowance'sal umamharic'sanalects'sanglicanism'sannual'santhropophagousapathetic allyappellationsaprils'archeryarm letsar tierasphyxiationas sortingatonalaudition'sautobiographical lyavuncularbactriabalk ingbandoleerbarbarity'sbarracudabast ionsbazaar'sbe fitbelie vebeneficence'sbestiaries'bibliophiles'black leggedbleats'bludgeon ingb oliviabopper'sboulevardbran dyingbreastbone'sbridecake'sbroadcast ingbrow beatingbuffbullheadedness'sbur docksbushbabies'buzzers'cacklescalender ing

camelliascandelabrumscantilevers'c argocarsickness'scastaways'catechizedcave'scentaurs'chablis'chancel'scharismaticcheat'schess manchimericalchroniccincturecleave rscliquierclottingcoadjutor'scochineal'scoercioncolonizescom fitscom modescompensatescomprehensive'sconciergesconduit'sconfucius'sconnection'sconsolidations'consumercon toursconvenecopy holdercornscorse'scotton tailcounter pointc ovencrabgrass'scravatscremationcrisp estcrucial lycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathless lydeceivers'dec ordefectsdegrees'delta'sdemoralize

depilation'sde ridedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdis associatingdis countsdisgruntlesdismemberment'sdisputesdistastefulness'sdivergences'docketsdogwooddoor casesdozeddrawingsdrug gistdues'duress'eaglets'edit ionseggcupselderflower'selision'sembezzlementemperorsen codesengineen slavesentries'epsom'seructatedestimators'evaluatingexactitudeexcretionex ilesexpletive'sextenuationeye dropfactorizedfalsehood'sfarrier'sfaultilyfeign edfestivals'field micefilm stripsfir thsflierflorists'flutes'f oldfooting'sfore groundfork lift

fossilizedfox huntingfreshersfrizzlesfryer'sfundamentals'fu stygall ingg apgas lightsgazinggent'sg harrygirdersg libglut engod parent'sgoods'grabs'granule'sgreat estgrim acesguard edgullies'gutturalsh aghankeringsharmonic'shatfulsheadachierh earth'sheehaws'henbanesheterogeneous lyhipposhogs headsho minghootershorsewhips'house fatherhungryhydraulics'idiomilluminationimmunizationimpingements'improbablein bornin competencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitive lyinspector ships'in takeinterjections'inter ringintroversion's

Page 122: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

in voiceirrelevances'iviesjanitor'sjerkyjob berjoules'j umblekappas'kibbutznikkinkedk night'skris'lad dieslamps'lap elslath eslaxations'l earnlegalizesleprechaun'sliability'sli edlights'line arliquorsliv idlockup'slong bowl orlove lornlugubriousness'luxuriant lymadeirasmaids'mal contentedmandarin'smanometricmargin'sm arrowsmasse usemat insmaypole'smedia tingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'miss usmocha'smolehills'monkeys'mooniermorphemicsmot ormouth washesmulberrymuscatel

mutt'snaked nessnationalizationnearsightedness'negress esneuro sesnicerni le'snocturne'snon violentnotaries'nuggets'nymphoobliteration'soccidentals'offer ingokay ingontology'sop posingordinalorthodontics'outclassoutrages'overgrowths'over statesoyster catcherspailfulspalliation'spanjabip arablepardon ingparson agespassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pin pointedpitchersplainchant'splatitudinouspl eatp lumberpoetasterpolicewomanpolypiportcullises'post cardpot ionspractisedprecepts'pre fabricationpre positionpressganging

prick edprioress'procrastinateprognosis'sprompt nesspro positionsprotest erprune rspullets'pun netpursuespyrex'squarantine'squestionnaires'quixoteraconteurs'railing'sraneesrate'sreams'reccesreconcilement'sredbreastrefereedrefutation'sreifiesrelictre mountsrepayments'reproach edreservistrespond ingretardationre valuerevolvers'riddledrills'river'sron doro taroutines'rue fulrun tssac king'ssages'salonssandals'sarongssauternesscallopedscientologistscowls'sc ripscurf'sseance'ssecretions'se esseneschal'ssepulchralservitudesex tonshameless ness

sheaths'shiver'sshowrooms'sibilant'ssiege'ssilksittings'skid pansskydiving'sslavers's limslothfulness'ssmallssneer ersociables'solariums'somnambulismsortie'ssouvenirspec tresspigotspivviestspools'squanderer'sstabilizedstalingstar esstay edstiffenersstockbreeders'stoolies'straightensstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicantsurprise'sswankssword fishsyndicaliststaborstales'tannin'star sitaxonomytec'stelephotographtender foottermite'sthai'sthermoplastic'sthistledown'sthrivesth wartstigress'ting'stontinestorments'

Page 123: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

tough nesstrack suittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertyped

umpteensunctuousness'under secretariesun eventfulunityun rollsup hillusherettesvagueness'svariance'sven al

vermifugevibrancy'svilla inviscountcyvodkavulgarianswa itwant'swarrenwatch dogwater logged

weariedweigh ingwest chesterwherrywid enwinterierwitticismswoodworkwretched lyyam'sziggurats'

Page 124: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 13, Results list for XX4

aboutaccelerateac curstaction 'sad jad ult'saeroplanes 'after care'sagreeablyair lettersalex ics'allowance 'salumam haric'sanal ects'sang licanism'sannual 'san thropophagousapathetic allyappellation sapril s'archer yar mletsar tierasp hyxiationas sortingat onalaudit ion'sautobiographical lyav uncularba ctriabal kingband oleerbarbarity 'sbar racudabastion sbazaar 'sbe fitbelievebene ficence'sbest iaries'bi bliophiles'blackleggedbleat s'bludgeon ingboliviabo pper'sboulevardbrandy ingbreast bone'sbride cake'sbroadcastingbrow beatingbuffbull headedness'sbur docksbush babies'buzzers'cac klescalender ing

camelliascan delabrumscan tilevers'cargocar sickness'scast aways'cat echizedcave 'scentaur s'ch ablis'chancel 'schar ismaticcheat 'sche ssmanchimericalchronicci ncturecleave rscl iquierclo ttingcoadjutor 'scochin eal'scoercioncolon izescom fitscommode scompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'scon solidations'consume rcontour sconvenecopy holderco rnsco rse'scotton tailcounter pointcove ncrab grass'scravat scremationcrisp estcrucial lycubiccup bearers'curtail ment'scyclamatesda ffodils'dan druff'sdeath lesslydeceiver s'dec ordefect sdegrees 'delta 'sdemoralize

de pilation'sder idedesolatingdethroneddew lap'sdicta phones'di ldosdipper 'sdis associatingdiscountsdis gruntlesdis memberment'sdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdruggistdues 'du ress'eagle ts'edition seggcupselder flower'seli sion'sem bezzlementemperor sen codesengineenslave sentries 'epsom'ser uctatedest imators'evaluatingexactitudeexcretionexile sex pletive'sextenuationeye dropfactor izedfalsehood 'sfarrier 'sfa ultilyfeign edfestival s'fi eldmicefilms tripsfirth sflierflorist s'flutes 'fol dfooting 'sforegroundfor klift

fossil izedfox huntingfresher sfr izzlesfry er'sfundamental s'fust ygallingga pgas lightsgazinggen t'sgharrygird ersglibglut engod parent'sgoods 'grab s'gran ule'sgreatestgrimace sguard edgullies 'guttural sha ghankering sha rmonic'shat fulshead achierhearth 'shee haws'hen banesheterogeneous lyhip poshogshead shominghoot ershorse whips'house fatherhungryhydraulic s'idiomilluminationim munizationimpinge ments'improbablein bornincompetencyinde cipherability'sindividual istsinexorablyin flow'sinitiateinquisitive lyinspectorship s'intakeinter jections'inter ringin troversion's

Page 125: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

in voiceirrelevances'iv iesjanitor 'sjerkyjob berjo ules'jumbleka ppas'kibbutznikkin kedknight 'skr is'lad dieslamps 'lap elslathe slax ations'lear nlegalize sle prechaun'sliability 'sli edlights 'line arliquor sliv idloc kup'slong bowlo rlove lornlugubrious ness'luxuriantlymadeira smaids 'mal contentedman darin'sman ometricmargin 'smarrow smasse usemat insmay pole'smedia tingmelodious ness'smer cer'smes tizos'me zzosmid shipmen'smine laying'smisadventure 'smis fires'missusmo cha'smole hills'monkey s'moon iermor phemicsmot ormouth washesmulberrymus catel

mu tt'snaked nessnational izationnear sightedness'ne gressesneurosesnice rnil e'snocturne 'snon violentnot aries'nuggets 'nymph oobliteration 'soccidentals'offeringokay ingonto logy'sopposingor dinalor thodontics'out classoutrages 'over growths'over statesoyster catcherspail fulspalliation 'span jabiparablepardon ingparsonage spassion flowers'path finder'spaw kiness'speel ers'peninsulaper ambulators'per iphrasispersecutor 'spesos 'ph arisaism'sphone calls'physiologists 'pillar s'pin pointedpitcher splain chant'spla titudinousplea tplumb erpoet asterpolice womanpolypipor tcullises'postcardpotion spractise dprecepts 'pre fabricationpre positionpres sganging

prickedprior ess'proc rastinatepro gnosis'sprompt nessproposition sprotest erpruner spullet s'pun netpursue spyre x'squarantine 'squestionnaire s'quixotera conteurs'railing 'sran eesrat e'sreams 're ccesreconcile ment'sred breastreferee drefutation 'sre ifiesrelic tre mountsrepay ments'reproachedres ervistrespondingretardationre valuerevolver s'riddle drill s'rive r'sron doro taroutine s'ruefulrun tssacking 'ssages 'salon ssandal s'sar ongssaute rnessc allopedsci entologistsc owls'sc ripsc urf'sseance 'ssecretion s'see ssen eschal'ssepulchralservitudesextonshame lessness

sheaths 'shiver 'sshow rooms'si bilant'ssiege 'ssi lksitting s'skid panssky diving'sslave rs'slimslothful ness'ssmall ssneer ersociable s'solar iums'somnambulismso rtie'ssouvenirspectre ssp igotsp ivviestspool s'squander er'sstabilizedstalin gstare sstayedstiffen ersst ockbreeders'stool ies'straighten sstr eptomycinstupefactionsub lieutenants'sub tenants'su ffragan'ssummer s'sun tanssup plicantsurprise 'sswan kssword fishsyndicaliststab orstales 'tan nin'starsitax onomyte c'stel ephotographtender footter mite'sthai 'sther moplastic'sthistle down'sthrive sth wartsti gress'ting 'ston tinestorments '

Page 126: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

tough nesstr acksuittram melledtr ansliteratingtravel ogue'strends etters'trilogytr uncatetulip 'stweezertype d

um pteensunctuous ness'under secretariesuneventfulunit yunroll suphillusher ettesvague ness'svariance 'svena l

ver mifugevi brancy'svilla invis countcyvodkavulgar ianswa itwan t'swarrenwatchdogwater logged

weariedweighingwest chesterwherrywide nwinter ierwitticism swood workwretchedlyya m'szig gurats'

Page 127: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 14a, Results for XX5

action 'sad jaeroplanes 'agree ablyair lettersallowance 'sal umannual 'sapathetic allyappellation sapril s'archer yar tieras sortingautobiographical lybal kingbarbarity 'sbast ionsbazaar 'sbe fitbleat s'bludgeon ingbo liviaboule vardbrand yingbroadcast ingbrow beatingbur dockscalender ingcamel liascar gocave 'scentaur s'chancel 'scheat 'scleave rscoadjutor 'scochin eal'scolon izescom fitscom modescompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'sconsume rcontour scopy holdercotton tailcounter pointcove ncravat scrisp estcrucial lycub iccurtail ment's

deceiver s'dec ordefect sdegrees 'delta 'sder idedipper 'sdis associatingdis countsdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdrug gistdues 'eagle ts'edition semperor sen codesenslave sentries 'exile seye dropfactor izedfalsehood 'sfarrier 'sfeign edfestival s'films tripsfirth sflorist s'flutes 'fol dfooting 'sfore groundfossil izedfox huntingfresher sfundamental s'fust ygall ingga pgas lightsgird ersglut engod parent'sgoods 'grab s'great estgrim acesguard edgullies 'guttural sha g

hankering shearth 'sheterogeneous lyhogshead sho minghoot ershouse fatherhung ryhydraulic s'impinge ments'im probablein bornin competencyindividual istsinquisitive lyinspector ships'in takeinter ringin voicejanitor 'sjob berknight 'slad dieslamps 'lap elslath eslear nlegalize sliability 'sli edlights 'line arliquor sliv idlong bowlo rlove lornlugubrious ness'madeira smaids 'mal contentedmano metricmargin 'smarrow smasse usema tinsmedia tingmelodious ness'smisadventure 'smonkey s'moon iermot ormouth washesmul berrynaked nessnational izationneuro sesnice r

ni le'snocturne 'snon violentnuggets 'nymph oobliteration 'soffer ingokay ingop posingout classoutrages 'over statesoyster catcherspalliation 'spar ablepardon ingparson agespersecutor 'spesos 'physiologists 'pillar s'pin pointedpitcher spl eatplum berpolice womanpost cardpo tionspractise dprecepts 'pre fabricationpre positionprior ess'prompt nesspro positionsprotest erprune rspullet s'pun netpursue spyre x'squarantine 'squestionnaire s'railing 'srat e'sreams 'reconcile ment'sred breastreferee drefutation 'srelic tre mountsreproach edrespond ingretard ationre valuerevolver s'riddle d

Page 128: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

rill s'rive r'sron doro taroutine s'rue fulrun tssac king'ssages 'salon ssandal s'saute rnessc ripseance 'ssecretion s'se essex tonsheaths 'shiver 'ssiege 'ssitting s'skid pansslave rs'slothful ness'ssmall ssneer ersociable s'spectre sspool s'squander er'sstalin gstare sstay edstiffen ersstool ies'straighten sstupe factionsummer s'surprise 'sswan kssword fishtales 'tar sitele photographtender footthai 'sthistle down'sthrive sth wartsting 'storments 'tough nesstulip 'stype dump teensunctuous ness'under secretariesun eventfulunit yun rollsup hillvariance 's

vena lvill ainvulgar ianswa itwatch dogwater loggedwear iedweigh ingwest chesterwi denwinter ierwitticism swood workwretched ly

Page 129: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 14b, Remainder List

aboutaccelerateaccurstadult'saftercare'salexics'amharic'sanalects'sanglicanism'santhropophagousarmletsasphyxiationatonalaudition'savuncularbactriabandoleerbarracudabelievebeneficence'sbestiaries'bibliophiles'blackleggedbopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchessmanchimericalchroniccincturecliquierclottingcoercionconsolidations'convenecornscorse'scrabgrass'scremationcupbearers'cyclamatesdaffodils'dandruff'sdeathlesslydemoralizedepilation'sdesolating

dethroneddewlap'sdictaphones'dildosdisgruntlesdismemberment'sduress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationfaultilyfieldmiceflierforkliftfrizzlesfryer'sgazinggent'sgharryglibgranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiateinterjections'introversion'sirrelevances'iviesjerkyjoules'jumblekappas'kibbutznikkinkedkris'laxations'leprechaun'slockup's

luxuriantlymandarin'smaypole'smercer'smestizos'mezzosmidshipmen'sminelaying'smisfires'missusmocha'smolehills'morphemicsmuscatelmutt'snearsightedness'negressesnotaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprickedprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesrepayments'reservistsarongsscallopedscientologistscowls'scurf'sseneschal'ssepulchralservitudeshamelessness

showrooms'sibilant'ssilkskydiving'sslimsolariums'somnambulismsortie'ssouvenirspigotspivvieststabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssuntanssupplicantsyndicaliststaborstannin'staxonomytec'stermite'sthermoplastic'stigress'tontinestracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetweezerusherettesvagueness'svermifugevibrancy'sviscountcyvodkawant'swarrenwherryyam'sziggurats'

Page 130: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 15a, Results for XX6

aboutadult 'sanglican ism'sarm letsa tonalbelie vebeneficence 'sblack leggedchess man

consolidation s'deathless lydismemberment 'sfault ilyfield micefork liftgent 'sg harryg lib

j umbleluxuriant lymercer 'smiss usnegress esprick edrepayment s'scowl s'shameless ness

s limsortie 'ssunt anstrack suitvagueness 'sviscount cywant 's

Appendix 15b, More Remainder List

accelerateaccurstaftercare'salexics'amharic'sanalects'santhropophagousasphyxiationaudition'savuncularbactriabandoleerbarracudabestiaries'bibliophiles'bopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchimericalchroniccincturecliquierclottingcoercionconvenecornscorse'scrabgrass'scremationcupbearers'cyclamates

daffodils'dandruff'sdemoralizedepilation'sdesolatingdethroneddewlap'sdictaphones'dildosdisgruntlesduress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationflierfrizzlesfryer'sgazinggranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiate

interjections'introversion'sirrelevances'iviesjerkyjoules'kappas'kibbutznikkinkedkris'laxations'leprechaun'slockup'smandarin'smaypole'smestizos'mezzosmidshipmen'sminelaying'smisfires'mocha'smolehills'morphemicsmuscatelmutt'snearsightedness'notaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'

plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesreservistsarongsscallopedscientologistscurf'sseneschal'ssepulchralservitudeshowrooms'sibilant'ssilkskydiving'ssolariums'somnambulismsouvenirspigotspivvieststabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssupplicantsyndicaliststaborstannin'staxonomytec'stermite's

Page 131: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

thermoplastic'stigress'tontinestrammelledtransliterating

travelogue'strendsetters'trilogytruncatetweezer

usherettesvermifugevibrancy'svodkawarren

wherryyam'sziggurats'

Appendix 16, Results for XX7

accelerateaccurstaftercare'salexics'amharic'sanalects'santhropophagousasphyxiationaudition'savuncularbactriabandoleerbarracudabestiaries'bibliophiles'bopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchimericalchroniccincturecliquierclottingcoercionconvenecornscorse'scrabgrass'scremationcupbearers'cyclamatesdaffodils'dandruff'sdemoralizedepilation'sdesolatingdethroneddewlap'sdictaphones'dildosdisgruntles

duress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationflierfrizzlesfryer'sgazinggranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiateinterjections'introversion'sirrelevances'iviesjerkyjoules'kappas'kibbutznikkinkedkris'laxations'leprechaun'slockup'smandarin'smaypole'smestizos'mezzosmidshipmen'sminelaying'smisfires'

mocha'smolehills'morphemicsmuscatelmutt'snearsightedness'notaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesreservistsarongsscallopedscientologistscurf'sseneschal'ssepulchralservitudeshowrooms'sibilant'ssilkskydiving'ssolariums'somnambulismsouvenirspigotspivviest

stabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssupplicantsyndicaliststaborstannin'staxonomytec'stermite'sthermoplastic'stigress'tontinestrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetweezerusherettesvermifugevibrancy'svodkawarrenwherryyam'sziggurats'aboutadult 'sanglican ism'sarm letsa tonalbelie vebeneficence 'sblack leggedchess manconsolidation s'deathless lydismemberment 'sfault ilyfield micefork liftgent 'sg harryg libj umbleluxuriant lymercer 'smiss us

Page 132: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

negress esprick edrepayment s'scowl s'shameless nesss limsortie 'ssunt anstrack suitvagueness 'sviscount cywant 'saction 'sad jaeroplanes 'agree ablyair lettersallowance 'sal umannual 'sapathetic allyappellation sapril s'archer yar tieras sortingautobiographical lybal kingbarbarity 'sbast ionsbazaar 'sbe fitbleat s'bludgeon ingbo liviaboule vardbrand yingbroadcast ingbrow beatingbur dockscalender ingcamel liascar gocave 'scentaur s'chancel 'scheat 'scleave rscoadjutor 'scochin eal'scolon izescom fitscom modescompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'sconsume rcontour scopy holder

cotton tailcounter pointcove ncravat scrisp estcrucial lycub iccurtail ment'sdeceiver s'dec ordefect sdegrees 'delta 'sder idedipper 'sdis associatingdis countsdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdrug gistdues 'eagle ts'edition semperor sen codesenslave sentries 'exile seye dropfactor izedfalsehood 'sfarrier 'sfeign edfestival s'films tripsfirth sflorist s'flutes 'fol dfooting 'sfore groundfossil izedfox huntingfresher sfundamental s'fust ygall ingga pgas lightsgird ersglut engod parent'sgoods 'grab s'great estgrim aces

guard edgullies 'guttural sha ghankering shearth 'sheterogeneous lyhogshead sho minghoot ershouse fatherhung ryhydraulic s'impinge ments'im probablein bornin competencyindividual istsinquisitive lyinspector ships'in takeinter ringin voicejanitor 'sjob berknight 'slad dieslamps 'lap elslath eslear nlegalize sliability 'sli edlights 'line arliquor sliv idlong bowlo rlove lornlugubrious ness'madeira smaids 'mal contentedmano metricmargin 'smarrow smasse usema tinsmedia tingmelodious ness'smisadventure 'smonkey s'moon iermot ormouth washesmul berrynaked nessnational izationneuro sesnice r

ni le'snocturne 'snon violentnuggets 'nymph oobliteration 'soffer ingokay ingop posingout classoutrages 'over statesoyster catcherspalliation 'spar ablepardon ingparson agespersecutor 'spesos 'physiologists 'pillar s'pin pointedpitcher spl eatplum berpolice womanpost cardpo tionspractise dprecepts 'pre fabricationpre positionprior ess'prompt nesspro positionsprotest erprune rspullet s'pun netpursue spyre x'squarantine 'squestionnaire s'railing 'srat e'sreams 'reconcile ment'sred breastreferee drefutation 'srelic tre mountsreproach edrespond ingretard ationre valuerevolver s'riddle drill s'rive r'sron doro ta

Page 133: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

routine s'rue fulrun tssac king'ssages 'salon ssandal s'saute rnessc ripseance 'ssecretion s'se essex tonsheaths 'shiver 'ssiege 'ssitting s'skid pansslave rs'

slothful ness'ssmall ssneer ersociable s'spectre sspool s'squander er'sstalin gstare sstay edstiffen ersstool ies'straighten sstupe factionsummer s'surprise 'sswan kssword fishtales '

tar sitele photographtender footthai 'sthistle down'sthrive sth wartsting 'storments 'tough nesstulip 'stype dump teensunctuous ness'under secretariesun eventfulunit yun rollsup hill

variance 'svena lvill ainvulgar ianswa itwatch dogwater loggedwear iedweigh ingwest chesterwi denwinter ierwitticism swood workwretched ly

Page 134: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 17, Prefix List used in Algorithms XX1-XX7

There are 363 in total and segmentation occurs after the last letter of the prefix.

a'aaabacadaeafagahaiajakalamanaoapaqarasatauavawaxayazb'babebhbibjblbobrbubwbybÃc'cacdcechci

clcncocrcsctcucyczd'dadbdddedhdidjdmdodpdrdsdudwdydze'eaebecedeeefegeheiejekelemeneoepeqeres

eteuevewexeyezfafefifjflfofrfsfufvfygageghgiglgngogrgugwgygzh'hahehihohshthuhwhyi'iaibicidig

iiikiliminioipiqirisitiviwizj'jajejijkjojukakckekhkikjklknkokrkukwkyl'lalelilllolplulvlxlym'

mambmcmdmemimlmmmnmompmsmumyn'nandnengninknonpnunyo'oaobocodoeofogohoiojokolomonoooporosotou

ovowoxoyozp'papcpepfphpiplpnpoprpsptpupypÃqaqsqurardrerhrirorurwryrÃs'sascseshsiskslsmsnsosp

sqsrstsusvswsyszt'tatbtctethtitotrtstutvtwtytzubudueuguhukulumunupurusutuzvavevivlvovrvuvywa

wewhwiwkwowrwuwvwyxaxexixlxsxtxuxvxxy'yayeyiylyoypyuyvyzzazezhzizlznzozuzvzw

Page 135: Unsupervised Machine Learning software for Morphology ... · Atif Akhtar Computer Science Session 2008 / 2009. Summary This project is based on the Morphology Challenge, ... Mid-Term

Appendix 17b, Suffix List used in Algorithms XX1-XX7

In total there are 387 suffixes, also note that these suffixes are in reversed form so the segmentation would happen behind the first letter of the suffix.

'saaabacadaeafagahaiajakalamanaoaparasatauavawaxayazbabbbebiblbmbnbobrbubzcacd

cecicncocpcrcscud'dadcdddedhdidldndodrdudwdyeaebecedeeefegeheiekelemeneoeperes

eteuevewexeyezfafefffiflfnfofpfrfsfugagdgegggigkgngogrguhahchdhehghihkhnhohphr

hshthziaibicidieifigihiiijikiliminioipiqirisitiuiviwixiyizjejijjjnjokakckekikl

knkokrkskukwkyl'lalbldlelflhlilllmlolrlsltlulvlwlym'mamemgmhmimlmmmomrmsmumwmy

n'nandnengnhninlnmnnnonrntnunwnyoaobocodoeofogohoiojokolomonoooporosotouovoyoz

papepiplpmpoppprpsptpupzqaqcqeqnqoqur'rarcrdrergrhrirkrorrrsrtrurys'sasbscsdse

sfsgshsiskslsmsnsospsrssstsusvswsyt'tatbtctdtetfthtitltmtntotptrtstttutwtxtzua

ubudueufuhuiujukulumunuoupurusutuzvavevivovxwawewowtxaxexixlxnxoxrxuxxxyy'yayb

ycydyeyfygyhyiyjykylymynyoypyrysytyuyvywyxyzzazeziznzozrzsztzuzz¦Ã©ÃÃl