unsupervised machine learning software for morphology ... · atif akhtar computer science session...
TRANSCRIPT
The candidate confirms that the work submitted is their own and the appropriate credit has been
given where reference has been made to the work of others.
I understand that failure to attribute material which is obtained from another source may be
considered as plagiarism.
(Signature of student)_________________________________
Unsupervised Machine Learning
software for Morphology
Challenge
Atif Akhtar
Computer Science
Session 2008 / 2009
Summary
This project is based on the Morphology Challenge, in particular the 2005 challenge. It aims to
develop an Unsupervised statistical machine learning software that will be able to segment an
input list of words into morphemes.
There are 2 phases to the implementation and within each there are 7 algorithms, each is a
refinement of the previous one. At the end the scores from the evaluations on the algorithms are
compared and a conclusion made.
Contents Page
1. Introduction
1.1 Aim
1.2 Relevance to Degree
1.3 Morphology Challenge
1.4 Minimum Requirements
1.5 Methodology
1.6 Schedule
1.7 Objectives
2. Background Research
2.1 Introduction2.2 Unsupervised Morphemes Segmentation
2.3 SUMA-Simple Unsupervised Morphology Analysis Algorithm
2.4 A Simpler, Intuitive Approach to Morpheme Induction
2.5 Starting Point
3. Implementation/Evaluation
3.1 Phase I
3.1.1 X1
3.1.2 X2
3.1.3 X3
3.1.4 X4
3.1.5 X5
3.1.6 X5
3.1.7 X7
3.2 Summary of Evaluations
3.21 Issues Encountered
3.3 Phase II
3.31 Summary of Development
3.32 XX1
3.33 XX2
3.34 XX3
3.35 XX4
3.36 XX5
3.37 XX6
3.38 XX7
4. Summary of Evaluations
4.1 Extensions/Issues Encountered
4.2 Conclusion
Appendix
A Personal Reflection
B Plan of Schedule
B2 Actual Plan
C Shows the comparison of the 4 major algorithms.
D Diagram shows flow of algorithm for the main program
E Example Tables
F Diagram shows the running of the XX7 algorithm.
G Diagram shows the word aftercare’s
H Shows the evaluation results of all the algorithms
1 Evaluation List D1
2 Sample Result List of X2
3 Sample Results X1
4 Sample Result List of X2
5 Sample Result of X3
6a Sample of Prefixes X4
6b Sample of Results from running X4 on Evaluation Set D1:
6c Sample of Results from running X4 on Evaluation Set D2:
7a Sample of Prefixes X5
7b Sample of Results from running X5 on Evaluation Set D1
7c Sample of Results from running X5 on Evaluation Set D2:
8a Sample of Prefixes X6
8b Sample of Results from running X6 on Evaluation Set D1
8c Appendix 8c, Reversed Sample of Results of X6 on D2:
8d Re-Reversed Sample of Results of X6 on D2:
9a Sample of Prefixes X7
9b Sample of Results from running X7 on Evaluation Set D1
9c Sample of Results from running X7 on Evaluation Set D2
10 Results of algorithm XX1
11 Results of XX2
12 Results list for XX3
13 Results list for XX4
14a Results for XX5
14b Remainder List
15a Results for XX6
15b More Remainder List
16 Results for XX7
17 Prefix List used in Algorithms XX1-XX717b Suffix List used in Algorithms XX1-XX7
1.1 Aim
The aim is to develop Unsupervised Statistical Machine Learning software that will be able to
segment words of a language into the smallest possible meaningful segment I.e. Morphemes.
Morphemes are commonly known as basic vocabulary units that could be used for different tasks
such as text understanding, machine translation, statistical language modeling and information
retrieval.
1.2 Relevance to degree:
This problem relates to the Artificial Intelligence division of my degree. The modules involved
include: AI22; Fundamentals of Artificial Intelligence, AI32; Natural Language Processing and
SE20; Object Oriented Software Engineering.
This problem is taken from the International Morphology Challenge competition that has been
running for the last 4 years.
1.3 Morphology Challenge
The Morphology Challenge is basically what’s been described above but it has variations and
follows a methodology. The general outline of the methodology is that participants are given a
list of words along with their frequencies that have been collected from the corpora of the
relevant language. They are then to use this information to construct an algorithm that can at the
simplest challenge, segment the words as best as possible into morphemes (2005 challenge). The
accuracy of the software would be measured by an evaluation script provided for each challenge
which compares the output of the software with a gold standard provided by the organizers of
Morphology Challenge.
So for example if the word list for the 2005 Morphology Challenge contained:
Reading
Read
Reads
Red
Rope
Then the output expected from the program would be:
Read ing
Read
Read s
Red
Rope
As the years of running the challenge went on, the challenge increased in complexity, as the
challenges went on, further languages were added and also further analysis was required upon the
word list I.e. just segmenting to morphemes would not be the goal. In the case of Morphology
Challenge 2007, the words had to be classified into a category, for example, if the words in the
list were:
Boots
Boot
Foot
Feet
The output expected would be:
Boot +plural
Boot
Foot+plural
Feet
1.4 Minimum Requirements
• To derive a suggested affix list from the English dataset of words.
• To produce a piece of software that will be able to segment the words in the English
dataset based on the affix list I.e. to follow the Morphology 2005 challenge.
• To evaluate the output from the software using the Evaluation script provided by the
Morphology Challenge organizers. This is vital to judge the accuracy of results.
• To comply with the rules of the Morphology Challenge I.e. the software must be
unsupervised and be language independent so code relating to the English grammar can
not be used. It must be a generic solution
1.5 Methodology
The first thing to do will be to have a look at the proceedings of the 2005 Morphology challenge
contest. From this an insight can be gained into how the participants went about constructing a
solution. It will help to generate ideas on how to make a start on the design of the solution. A
rapid proto-typing approach will be taken as opposed to an incremental stage approach (Waterfall
Method). After implementing the basic algorithm, it will be refined numerous times in order to
get better results. There will be 3 phases for each iteration:
1. Design - This will involve generating goals on what the algorithm is expecting to do.
2. Implementation - This involves actually implementing the algorithm in Java.
3. Evaluation - This will be similar to a reflection, what was gained from implementing the
algorithm, was it a success? This will also indicate on how to progress to the next Prototype.
Thus the evaluation and implementation will be running concurrently. It is felt that this is a good
approach as the project is quite practical, so reading up great amounts of theory will not prove to
be very productive, a practical approach will give quicker results and reflections on the algorithm
wherefrom improvements can be made. It will also highlight the possibilities of what can
actually be implemented rather than doing heavy designing and reaching a stage where the
algorithm may not be able to be implemented.
1.6 Schedule
Appendix B displays the planned schedule at the start of the Project and Appendix C displays the
actual plan. The actual compared to the plan looks quite different but the implementation was
split into two parts overall, so Figure 1 in Appendix C serves as the first iteration described in the
plan in Appendix B and Figure 2 in Appendix C serves as the second iteration. It is to be noted
that in total there were 14 iterations, 7 in the first part and a further 7 in the second.
List of Milestones
Task Start FinishMid-Term Report 15/11/2008 19/12/2008Completion of Phase 1 14/01/2009 10/03/2009Project Presentation 18/03/2009 18/03/2009Completion of Phase 2 10/03/2009 12/04/2009Final Write-Up 14/04/2009 29/04/2009
1.7 Objectives
As the project is based on the Morphology challenge, The Morphology Challenge website
provides plentiful resources if not sufficient for what needs to be achieved.
The first step of research will be to investigate the website and to read through the proceedings of
the participants over the number of years. This will help to stimulate ideas of how to go about
structuring an initial algorithm.
Once the background research phase is completed the objective is to come up with a basic,
perhaps vague algorithm that would be able to explore the semantics of segmenting a dataset of
words and then furthermore to implement this software in the programming language java.
The reason Java was chosen as the programming language was firstly due to the familiarity with
it, secondly as Java is object oriented it provides greater flexibility and reusability. It also has a
very simple and clean structure which would make it easier to write the program.
2. Background Literature Review
2.1 Introduction
The term ‘morphology’ was first coined by the German poet, novelist, playwright and
philosopher Johann Wolfgang early in the nineteenth century in a biological context. Its
etymology is Greek: morph-means ‘shape, form’ and morphology is the study of form of forms.
In linguistics “morphology refers to the mental system involved in word formation or to the
branch of linguistics that deal with words, their internal structure, and how they are formed”.
(Aronoff. A and Fudeman. A. (2005) pp.1-2 )
Though words are generally considered as being the smallest units of syntax, it is clear that in
most languages, words can be related to other words through rules. For example, English
speakers understand that the words dog, dogs, and dog catcher are closely related. English
speakers recognize these relations from their tacit knowledge of the rules of word formation in
English. The rules understood by the speaker reflect specific patterns in the way words are
created from smaller units and how those smaller units work together in speech. In this way,
morphology is the branch of linguistics that studies patterns of word formation within and across
languages, and attempts to devise rules that model the knowledge of the speakers of those
languages.
Every language has its own unique structures. Beginning with the sound system to meaning
(semantics), they form the foundation of a language. Acquiring a language implies acquiring all
those structures.
“Morphology is an area that studies structures, forms and categorizations of words.”
(Jalaluddin. N. H. (2008) Pg.109)
Affixes.
Malay has pre-fixes, suffixes, circumfixes and infixes while in English pre-fixes and suffixes are
more prominent. The difference between Malay and English affixes is English affixes can
indicate or produce negative meanings, for example im-, dis-, mal- and ir-. These affixes
transform the positive meanings into negative. We have possible and impossible or obedient to
disobedient. This phenomenon does not exist in Malay.
Prepositions exist in both Malay and English. However, its usage may sometimes be influenced
by culture.
Plural inflections- ‘s’ and ‘es’.
Inflections are affixes added to a root word to indicate a grammatical meaning. In English, -s is
added to book- books to indicate plurality, -ed as in walked or talked to indicate past tense.
Inflection, however, does not exist in Austronesian languages, including Malay. In English there
are three markers to indicate plurality- -s, -es, and –ies. Plural inflection becomes more
complicated when it is influenced by phonological rules. For words ending with consonant /h/ ,
its plural form is inflected with –es, for example ostrich- ostriches. However, this does not occur
with words that end with /t/ as in accident- accidents.
Compared with Malay language, plurality is indicated by cardinal or ordinal words. Some
examples of Malay cardinal words are semua (all), sebahagian (some) and tiap (every) while
ordinal words are kedua (second), ketiga (third), keemat (fourth) and many others (Asmah 1986).
Plurality can also be indicated by the pre-fix ber- to words of measurement, which then undergo
reduplication process, for example – berjam-jam (hours), berhari-hari (days after days),
berbulan-bulan (month after month) and many others. It is clear that Malay language and English
have different forms to indicate plurality.
Adverbs are easily identified in English with the –ly marker as the clue.
Syntax
Syntax is one of the main areas of linguistics in which sentence structures and patterns are
analyzed. Although English and Malay share the basic structure that is ‘subject-verb-object’,
there are numerous other differences between the two languages such as the usage of copula ‘be’,
subject-verb agreement, articles, determiner and relative pronouns.
Copula ‘be’.
Within the English grammatical system, the form of copula ‘be’ is vital within a sentence
as it links the subject of a sentence with a predicate. There are three forms of copula ‘be’, for the
present tense specifically ‘am’, ‘is’ for the third person singular subjects and ‘are’ as well as ‘you’
for plural ones. As for the past tense form ‘was’ is used for singular subjects (I, he, she, it) while
‘were’ is for plural subjects (you, we they) including ‘you’ in the form of second person singular.
Determiners.
In the grammatical arrangement of English language, indefinite and definite articles a, an, are two
of the differing types of determiner that are applied to premodify a head noun in a noun phrase.
Relative pronouns
In English relative pronouns are used in association to connect one article to another. Relative
pronouns refer to nouns that have been cited previously in the article or sentence. The following
are the 5 types of relative pronouns in the English language: that, which, who, whom, and whose.
‘Who’, ‘whose’ and ‘whom’, are made use of when referring to people. ‘Which’, is used to refer
to things, place or idea and ‘that’, can be used to refer to people or things.
(Jalaluddin. N. H. (2008) pp.109-115.)
The main focus of the background reading were the proceedings from the Morphology Challenge,
in particular the 2005 one.
A few previous approaches have been described in (Goldsmith, 2001) but there are generally 4
ways of attempting a solution.
1. Identification of morpheme boundaries using transitional probablities
2. Identification of morpheme internal bigrams or trigrams
3. Discovery of relationships between pairs of words.
4. Information theoretic approach to minimize the number of letters in the morphemes of the
language.
2.2 Unsupervised Morphemes Segmentation
The first paper that was examined was (Rehman and Hussain 2005). The approach laid out in this
report was of two stages:
• Learning Model
• Segmentation
The learning model phase is where a model is built up from a corpus from which a list of
morphemes can be derived and the second stage being where these morphemes are used to
segment the words. It is described in this paper how there are two basic types of morphemes,
roots and affixes. The root is the main part of the word and the affixes are prefixes and suffixes.
An important point to be observed in the paper is that limits on the length of words has been set.
For a part of a word to be qualified as a prefix or suffix it has to be at maximum length 3. On the
other hand the length of a root morpheme is limited to 13 characters.
This Is something to keep in mind, the purpose of setting limits is to prevent complications and to
exclude anomalies, better understanding would be had of exactly what’s being analysed. It
focuses the sample set between known thresholds. Though this could help the understanding
factor and be a good help in debugging initially, the words that would be beyond the threshold
would not be analysed and hence would affect the reliability of the solution.
The learning model was implemented in Microsoft Access and it took the most frequent words as
possible morphemes from a given corpus. The learning model however will not extract words
that have a frequency less than 7 in the corpus, this can be seen as another control mechanism.
The reason why such mechanisms could be so important is that their main purpose would be to
refine the solution, to give better results, it may be that mistakes have been made in the spellings
of words and the frequency of such words would presumably be quite low so by filtering out
some of the infrequent words it could possibly get rid of some noise. However in the proceeding
the purpose of using such a measure was to speed up the processing.
This also leads to another issue, running time. How long would the software take to extract
morphemes from a given dataset?
Another factor to consider is what affect adding limits on the morphemes would have on the F-
Measure at the end?
An example is given in the paper as to how the algorithm operates, given a set of words:
1. Ab
2. Abacus
3. About
4. Abreast
5. Again
6. Bargain
The algorithm starts off with the first letter of the first word and would then check the occurrence
of that letter within the list. The occurrence of ‘A’ in the list at the beginning of a word is 5. The
program then searches to find ‘A’ as a separate word in the list on its own. If it is found then it
accepts ‘A’ as a possible segmentation point, otherwise ignored.
The algorithm then moves onto the second letter and looks to see the occurrence of the word ‘ab’
in the rest of the list. It finds it 4 times but it also finds ‘ab’ as a word on its own therefore it adds
‘ab’ to an empty list of valid segmentations which it can use in the segmentation phase. Like this
the algorithm makes its way through all the entries of the word list and either adding them to the
list of valid segmentations or ignoring them.
Segmentation Phase
The segmentation phase is implemented in Visual Basic. The segmentation model aims to
separate the prefix, suffix and root morpheme of each word. Firstly the algorithm checks each
character of a word from the first till last. If any part of it is found in the valid segments list and
also the remaining characters are found in the list, then the separated segment will be treated as a
possible prefix. The rest of the string is then passed onto the suffix assessing part which carries
out a similar role except that it starts from the end of the word and works its way up to a
maximum of 3 trailing characters and uses the list to determine suffixes in the same way as it
does for prefixes.
Points Gathered From this report
• This report gave an initial understanding of how one might go about working on a solution
• Identified two possible stages in a solution, Affix Gathering & Segmentation.
• Limits set on character lengths to simplify analysis of words, however this could affect the
total reliability of the solution but perhaps a compromise has to be sought between filtering
out noise and reliability.
• Filtering out helped speed up processing.
• Dictionary sort used on the valid segmented list, the importance not yet fully understood.
2.3 SUMA-Simple Unsupervised Morphology Analysis Algorithm
(Dang, M and Choudri, S. 2005)
Key Terms:
• Successor Variety-The value of the number of different combinations succeeding a
substring of a word respective of the word list; used for prefix gathering.
• Predecessor Variety- The value of the number of different combinations preceding a
substring of a word respective of the words in the word list; used for suffix gathering.
• Peaks-The substring at which the maximum successor or predecessor variety value occurs.
In this report a slightly different method was used to produce a solution to the previous method
explored. This method had great focus on language pattern and structural recognition. The main
strategy is to record successor and predecessor varieties. However like the previous solution
there were two stages:
• Affix Gathering
• Segmentation
The varieties can be illustrated with an example, if the word list consists of the words:
1. Reading 2. Reads 3. Red 4. Rope 5. Ripe 6. Read
The algorithm like the previous one would start with the first letter of the first word so it would
examine the ‘R’ in Reading, it would then compare the possible different letter combinations after
‘R’ in the word list so in this case the algorithm would record a value of 3 (e, o, i).
The algorithm would then move onto the next letter and would search for different letter
combinations after ‘RE’ and record those values.
In this way the algorithm would go through the whole word and then move onto the next word
recording successor varieties.
Once the successor varieties have been recorded, the algorithm would search for ‘peaks’ I.e.
searching for the substring with the highest successor variety count.
The table below illustrates this with the word under examination to be ‘READABLE‘ in the list
above:
Substring Successor Variety
R 3, E,O, I
RE 2 , A,D
REA 1 D
READ 3 A, I, S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 blank
Here it can be seen that two peaks occur, one at the beginning of the word and one at ‘READ’.
The report indicates that adding the two substrings above into the valid segment list and using
that to segment the words did not yield in great accuracy on evaluation.
The reason to this is simple to contemplate, so many words would start with ‘R’ and though it is
true that the successor variety would be very high, if taken into the valid segment list it would be
implying that every word beginning with ‘R’ would be segmented at the beginning which would
give an incorrect segmentation for most words.
This solution was then refined, Predecessor variety values were then collected in the same
manner as the successor variety values. In the same manner peaks were found and the results
seemed to improve slightly. This was explained as the words were more heavily suffixed than
prefixed in the word list. The end product was a hybrid of the two and that’s what produced the
best results out of the three methods used.
Points Realised from Report
• Again the solution is structured in two stages, this is beginning to seem like a fundamental
structure.
• Words that begin with ‘R’ alone would not be segmented at the beginning. Every single word
beginning with any letter would be segmented after the initial letter as surely the highest
successor count would be for the initial letter of a word as there are so many combinations
possible after the first letter of a word.
• A practical distinction is made between gathering prefixes and suffixes, in the previous report
such a distinction was not made, it was more general , gathering ‘affixes’.
• To gather suffixes, a simple reversal is needed of the prefix gathering technique. This can
hold true for any technique used to gather the suffixes or vice versa.
• A hybrid approach gave the most accurate result in the evaluation.
• Not a lot of limits seem to have been set in this solution as opposed to the previous report.
2.4 A Simpler, Intuitive Approach to Morpheme Induction (Pitler, E and Keshava,S 2005)
More suited to Indo-European languages due to concatentative morphology. Two basis for this
approach:
1. Finding words that appear as substrings of other words
2. Detecting changes in transitional probabilities –originally proposed by (Harris, 1995) stated
that peaks would be present where given a word and checking to see what other words in the
corpus share the same starting string. Based on this approach (Hafer and Weiss, 1974) were
able to develop an algorithm that achieved 91% precision.
The four stages in this solution are:
1. Building Lexicographic Trees
2. Scoring Morphemes
3. Filtering
4. Segmenting
The first three stages can be thought of as the affix gathering stage.
Building Trees
The algorithm starts off by creating two trees:
• Forward Tree
• Backward Tree -Mirror of the forward tree.
If the alphabet contains b letters and the longest word in the corpus is of d letters length then each
tree would be b-way by d-depth long.
Each node of the tree represents a letter so any path from the root to a node would represent a
substring of a word, in the case of the forward tree, this would be the substring proceeding the
rest of the word and in the case of the backward tree the substring would be the one after the
word. The nodes contain the frequency of the corresponding substring.
The purpose of these trees are to calculate the conditional probabilities given a substring to
predict the next part of it. For example:
The forward tree can be used to gather conditional probabilities of suffixes I.e. Pr(sbook), this
would be calculated by dividing the frequency of words starting with ‘books’ by the frequency of
words starting with ‘book’. Similarly the backward tree can be used to calculate Pr(ooks) which
is done by dividing the frequency of words ending with ‘ooks’ by the frequency of the words
ending with ‘oks’.
Scoring
After building the trees, two lists are to be formed of morphemes known as a prefix list and a
suffix list. The suffix list is filled by scanning each word in the list in increasing substring order
starting from the end of a word. If a substring is to be accepted into the suffix list it must fulfil 3
conditions. If the substring ‘xy’ is being speculated to be a suffix in the word ‘vwxy’ , then:
1. ‘vw’ must also be a word in the corpus
2. Pr(wv) = 1
3. Pr(xvw) < 1
The first condition is quite logical, it makes use of the fact that suffixes are commonly added onto
the ends of words so in order to claim a reliable suffix the remainder of the string must exist as a
word I.e. if the word being examined is ‘books’ and the suffix under speculation is ‘s’ then surely
it makes sense to find ‘book’ as a separate word in the list in order to give credibility to the suffix.
The second condition is implying that the stem(’vw’) only has one parent thereby identifying it as
a true stem and the third condition is stating that there can be multiple children I.e. different
suffixes applied after the stem.
In the same way there are 3 conditions for a prefix under speculation to gain entry into the prefix
list. They are simply the reverse of the suffix conditions.
If an affix passes the 3 conditions then it is given a score of 19 points and if it fails then 1 point is
deducted. When every word has been considered then the strings with positive values at the end
are all granted entry into either the prefix or suffix list. The reason behind the use of these
numbers was that a string would only have a final positive score if it passed the tests at minimum
5 percent I.e. 1/1+19 of the times it appeared.
Filtering
Sometimes, an affix in a list would be a combination of 2 affixes that are also present in the list,
this could give rise to incorrect segmentation later on so the filtering process checks for such
affixes and discards them.
Segmentation
This phase segments the words using the affix lists, however an issue of how to segment a word
may occur if more than one affix could be applied I.e. in the word ‘quietness’, the issue arises
whether to segment it as ‘quitenes s’ or ‘quite ness’. This is solved by recalling the conditional
probabilities from the tree constructing phase and finding the one with the highest probability.
Points Gained from report:
• Quite a thorough approach, reduces inaccuracy greatly by reliability measures that were put
in place such as the 3 conditions.
• There is more than one way to segment a word, one could be more accurate than the other.
• Even after affixes being entered into the appropriate lists there still may be some
discrepancies as described in the Filtering stage
• Conditional Probabilities perhaps is something more advanced than successor varieties,
better solution?
2.5 Starting Point
Having reviewed the background literature I feel a rough prototype should be made as soon as
possible. Though the literature has provided a few solutions I deem them at this stage to be quite
complex to implement straight away. I feel I would need to find my own feet about this though I
will be looking closely at the SUMA report as In theory it seems to be simpler than the
conditional probability approach. The programming aspect will be quite important because if I
am not able to design the solution then it will be useless reading into all the literature. If I start
off with a rough program that will be used to experiment with the code then I can build up from
that and implement more complex algorithms.
3. Implementation & Evaluation
3.1 Phase I
The rapid prototyping methodology was applied so a series of prototypes in increasing
sophistication. Below are presented the details of implementation and evaluation of each
successive stage, rather than evaluating all of these in a separate chapter.
It is to be noted that the implementation is split into two parts, each part containing a set of
algorithms. The algorithms in the second set are analyzed greater in depth as by that point there
is more information handy to make analysis. As the first set of experiments were the starting
point, it is not possible to make much analysis.
The first issue that needs to be looked at is the datasets used. In this implementation a variety of
datasets were used, most are described in the appropriate algorithms sections but its important to
give an outline about them as it will help to understand the algorithms better.
This is due to the rapid prototype approach that was taken; there was not just one implementation
and one evaluation but a constant implementation-evaluation circle so it is felt that it would
reflect a better account of the procedure taken.
Also a brief description on the evaluation is vital for clarity purpose.
Evaluation Description
Evaluation was done via the Evaluation.perl script obtained from the Morphology Challenge
2005 website. It would compare the Results List output from the algorithm to the gold standard
file provided also on the website. The gold standard file contained the standard of segmentation
that was being sought and upon which accuracy results would be given.
After comparing it gives 3 results relating to the suggested file:
Precision: number of hits divided by the sum of the number of hits and insertions.
A hit is when a boundary placed in the suggested file matches with a boundary placed in
a word in the gold standard file I.e. correctly placed.
An insertion is when an incorrect boundary has been placed in a word I.e. the boundary
placed in the suggested file is not present in the gold standard file word.
Recall: the number of hits divided by the number of hits and deletions. A deletion being
that when a boundary should be in a word but it isn’t there.
F-Measure: Harmonic mean of precision and recall.
There are two methods of evaluating the output, Precision and Recall, the F-Measure is the
average of the two but to keep the evaluation simple and easy to follow, the result of Precision
will be held analyzable though it states on the Morphology Challenge website that the contestant
with the highest F-Measure will be the winner and if a tie is to occur then the contestant with the
highest precision will win. For this reason the F-Measure will also be included in the results but
not analyzed upon in depth, it will be a means of illustrating an average of two different
evaluation methods. A point to note is that this evaluation script was not run on the initial
algorithms as it was thought there was not enough grounds present to give an accurate result
therefore it was not considered, this may have been due to the lack of prefixes calculated or that
the size of the dataset was too small.
Datasets
Definition of Data List: The list of words given on the Morphology Challenge 2005 webpage.
The original list is alphabetical and contains frequencies before each word. The size and entries
of this list are subject to modification, i.e. samples can be taken from this list and referred to as
Data List. This List is used to extract affixes and checking for presence of words
(Implementation Part II).
Definition of Evaluation List in the context of this implementation: The set of words on which
the segmenting module segments.
The 4 major algorithms: X4, X5, X6, X7.
There were two major Evaluation Lists in this implementation, the first being D1 and the latter
D2. It is to be noted that segmentation were not done on these evaluation lists till mid-way,
algorithms X1-X3 especially as it was realised that there was no common evaluation list to
compare the output.
D1
This served as the first comparison of the results obtained from the 4 major algorithms developed.
It constituted of the 200 most frequently occurring words and it was this that was used at first to
see the difference between the applications of the major 4. (See Appendix 1, Evaluation List D1)
D2
Having seen that the evaluation of the D1 dataset was not highly productive (details given further
on), this dataset was incorporated. It constituted of exactly the same words that the evaluation
script checks for and depending on the segmentation, gives scores. From this it was thought a
much greater reliability of results could be had, not necessarily greater in accuracy. See
Appendix 2, Evaluation List D2.
3.1.1 X1
Aim of algorithm: test Segmentation and File I/O procedures.
Sample of Results: Appendix 3, Sample Results X1
Size of Data List:1000
Size actually used: 1000
Evaluation List: Same as Data List
Pseudo code:
Read in the first 1000 words from the word list and store in array.
Sort array in shortest length to longest order.
For k:size of array -1
String A=array[k]
For j=k+1:size of array -1
String b=array[j]
Check if B contains A
If B contains A:
Segment B in position of A.length
Else:
Do nothing.
Output the segmented array.
The first algorithm that was implemented is known as X1, it was a simple implementation for
which the basis was to check all words against each other in the Data List which at this initial
state was set to 1000 words, irrespective of numerical frequency or alphabetical order. The first
1000 words were taken into an array and that array was sorted in an order of shortest to longest
word. This was done to make this algorithm work as it was assuming that the shorter word would
be part of the larger word e.g. if the word list was:
Read
Reads
Reading
Then it makes sense to compare Read with Reads and Reading, and to segment after Read in each
word, also not to compare Reading with Reads or Read. Arranging in such a way simplified
matters as the program can work through the list systematically once.
This prototype was also made to test the segmenting semantics to get an idea of how the words
could be segmented. In my opinion, implicitly perhaps but the rest of the algorithms or the
mechanics of the algorithms are based on this first prototype to quite an extent. It was the first
stepping stone. An important point to be mentioned is that in algorithms X1-X4, the Evaluation
List(the one that is segmented) is the same as the sample that is taken from the Data List, so in the
case of X1, the list of words that are segmented are the first 1000. See Appendix(1, Sample
Results X1). The sample of the results list show that there is a lot of segmenting, quite randomly
spread out but the point of running this algorithm was to get an initial idea of how things would
run.
Evaluation: too early
3.1.2 X2
Aim of Algorithm: Test SUMA algorithm.
Sample Results: Appendix 4, Sample Results X2
Size of initial Data List: 1000
Size actually used: 1000
Evaluation List: Same as Data List.
Prefixes found: 3
The second algorithm that was looked at was the one described in the SUMA report (Dang, M and
Choudri, S. 2005) this algorithm was based predominantly on Successor varieties. Each substring
of a word is considered in increasing order, the substrings compared and the differences of the
following letter taken into account, so for example:
Read
Reads
Reading
The system would first compare the substring ‘R’ of read with substring ‘R’ of Reads, for the
purpose of reliability substrings less than 1 do not count, so once the substring count has gone up
to ‘Re’ with ‘Re’ then it moves onto compare the next letter of the two words, if they’re different
then the system increases count of the successor value for that particular substring ‘Re’. Once it
checks that level of substring from ‘Reads’ it moves onto the next word in the list to compare the
same substring; ‘Re’. This continues until all entries have been checked, it then comes back to
the first word and increases the substring so it becomes ‘Rea’ and repeats the process, counts the
differences and excludes the similarities. Once it’s finished comparing the whole first word then
it continues onto the next word and so on.
After this the program calculates the maximum successor value for each word and adds the
corresponding word to the ‘prefix’ list which is then run on the Evaluation List and segmented at
every occurrence of a prefix entry.
So to review, X2 was applied to 1000 words irrelevant of order or frequency of word. The output
(Appendix 4, Result List of X2), was interesting to see as it was the first application of the SUMA
algorithm, it listed the prefixes detected, ‘aa, ab, ac’ and then just segmented the 1000 words
wherever these strings occurred. This gave an indication that the algorithm was indeed working
but the prefixes detected did not seem to be very prefix like, there was only 3 detected. The
reason for this was observed to be the sample Data List, 1000 words were far too small to give a
reliable calculation of prefixes so for the next run of the algorithm, the element of increasing the
sample set was added.
The pseudo code for this algorithm is shown below, the differences between the versions are
highlighted in bold:
Pseudo Code:
Read in first 1000 words from word list, save in array.
Remove the frequency in front of words, leave just the word on each line.
Sort the array from shortest to longest length order.
Run the comparison module, storing successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the Affix array
appears.
Evaluation: Not enough grounds to run script; dataset far too small, vaster datasets need to be
considered to give a greater variety of prefixes.
3.1.3 X3
Aim of Algorithm: Increase Data List to achieve improved accuracy in segmentation.
Sample Results: Appendix 5, Sample Results X3.
Size of initial Data List:100,000
Size actually used: 100,000
Number of Prefixes found: -
Evaluation List: Same as Data List
Pseudo code:
Read in first 100,000 words from word list-save in array.
Remove the frequency leaving just words on line each line.
Sort array from shortest to longest order.
Run comparison module-store successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the Affix array
appears.
This algorithm was identical to X2 with the added fact that the dataset was increased to 100,000.
This in theory would give the prefix gathering module a much wider selection to work with and
hence give a more reliable calculation of prefixes which would lead to more accuracy in
segmentation. The segmenting module was kept the same, which was based on X2.
After running the algorithm on the first 100,000 words from the data list still unsorted by any
means other than sorted from shortest to longest word; once read in to the system the results were
observed (Appendix 5, Sample Result X3). It seemed that quite a lot of prefixes were generated,
in fact in the thousands, some prefixes were really strange but what was even more strange were
the words, there seemed to be clear discrepancies in the actual words I.e. they were not proper
words(Appendicies 4 & 5) and at this point it was highlighted that the corpus that this word list
was drawn from itself was drawn from a number of sources. This was not known before and it
was established that there were indeed words that could be considered typos in the word list
which resulted in unnecessary noise and unclear difficult comparisons.
So at this point it was decided to try and filter out some of the ‘noise’ that was being caused by
inaccurate entries in the word list so a better comparison could be made.
Evaluation: Not enough grounds: words segmented all over the place, no way of comparing,
unsorted, very messy, needs tidying up. Data sample good but too much noise!
3.1.4 X4
Aim of Algorithm: Reduction of noise words in the Data List.
Sample of Results: Appendicies 6a, 6b, 6c
Size of initial dataset: 100,000
Size of data actually used I.e. >1: 62,261
Number of Prefixes found: 2476
Evaluation List: D1 & D2
Pseudo code:
Read in first 100,000 words from Data-List-save in array.
Extract the frequency
Compare frequency with the ASCII value 49 I.e. 1
If frequency > 49 (1):
Store corresponding word in new array
Else:
Ignore the word, do not store in new array
Use new array for prefix gathering.
Sort new array from shortest to longest order.
Run comparison module-store successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the Affix array
appears.
This algorithm is similar to X3 but this is the first run where a threshold has been introduced as it
was observed in the previous run that there were too many inaccurate words. In previous
algorithms the word frequency was ignored but the threshold relies on the word frequency hence
the algorithm had to be modified to allow this change to take place. The sample set was kept to
100,000 words.
By running this, the output of prefixes detected decreased significantly which can be taken as a
positive affect, the less the number of prefixes, perhaps the greater the reliability but a
compromise has to be sought.
However from a sample of 100,000 words there were still a good 2000+ prefixes detected
(Appendix 6a, Sample of Prefixes X4). It was hard to make any comparisons though with the
output from X3, this was due to not having the words in an order of some sort. It was decided in
order to make comparisons easier between the prefixes detected and the segmentation of the
words, some sort of order would have to be placed; the obvious choice being the alphabetical
order.
Evaluation on D1: Not done, D1 not vast enough for running script on, only 200 words and most
single morphemes, not constructive to run evaluation script on however the list was segmented,
see Appendix 5b.
Evaluation on D2 (Appendix 6c):
Number of words in gold standard: 532 (type count)
Number of words in data set: 532 (type count)
Number of words evaluated: 531 (99.81% of all words in data
Morpheme boundary detections statistics:
F-measure: 24.84%
Precision: 18.73%
Recall: 36.85%
This was the output from running the evaluation perl script. It shows a very low F-measure and
Precision. This can be attributed to the fact that there were so many prefixes detected and also
that the algorithm segments each word with every prefix that is detected within it.
3.1.5 X5
Aim of Algorithm: Further reduction in noise
Sample of Results: Appendix 7a, 7b, 7c.
Size of initial dataset: 100,000
Size of data actually used I.e. >10: 25,781
Number of Prefixes found: 2476
Evaluation List: D1 & D2
Pseudo code:
Read in first 100,000 words from word list-save in array.
Extract the frequency
Compare frequency with the ASCII value 97 I.e. 10
If frequency > 97(10):
Store corresponding word in new array
Else:
Ignore the word, do not store in new array
Use new array for prefix gathering.
Sort new array from shortest to longest order.
Run comparison module-store successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the Affix array
appears.
This algorithm was made and was run in parallel with X4, the difference between X4 and X5 is
that for X5 the threshold was increased to >10 I.e. only use words that occurred more than 10
times for prefix extraction. This was done to see how different the results would be between the
two sets, if there was not a great difference then X5 could be used as the new standard to save
computational effort. X5 detected 199 prefixes and a look over the results (Appendix 7) shows
that there was not a lot of difference between the actual segmentation but again comparison was
extremely difficult as the prefix list was not sorted at this stage nor the dataset the same, I.e. the
dataset of words that were segmented in X4 were words that occurred more than once and the
dataset of words segmented in X5 were only the ones that occurred more than 10 times. Because
of this discrepancy a common dataset was needed to make a good comparison possible along with
the sorted list of prefixes as there were so many.
Evaluation on d1: Not done-unproductive although the list was segmented, see Appendix 7b.
Evaluation on d2 (Appendix 7c):
Number of words in gold standard: 532 (type count)
Number of words in data set: 532 (type count)
Number of words evaluated: 531 (99.81% of all words in data
Morpheme boundary detections statistics:
F-measure: 24.51%
Precision: 18.33%
Recall: 36.98%
There doesn’t seem to be much difference between the results of X4 and X5 therefore for any
further development X5 will be used as the standard dataset as it would save a lot of
computational effort as the Data List for X5 is 2/3’s less than the Data List for X4.
D1
The above descriptions may have seemed confusing relating to the Evalution and Data Lists. The
running of the Evaluation Lists D1 & D2 was done at the end of developing the algorithms
although they are included above. Algorithms X1-X4 segmented on the Evaluation List which
was the same as the Data List that was used to gather prefixes.
The reason for creating D1 was that initially the algorithms would segment the same input set
that they would be taking in and by the time algorithm X4 had been developed this segmentation
was a good 100,000 words or even with the threshold, the segmentation was in the tens of
thousands. Also the sample that X4 would be taking in would not necessarily be the same as the
sample segmented by X5 therefore comparison between them was very difficult.
It was clear that there needed to be a common dataset and D1 was introduced (Appendix 1,
Evaluation List D1).
It would make comparison a lot easier as the dataset would firstly be in alphabetical order and
would not spur several pages as the before Evaluation lists did but could be fit onto a single sheet
each.
The evaluation script however was not run on the results from D1, this was because it was seen
from the results that this was not a suitable sample for the evaluation script to be run on and it
would not give a fair account of the accuracy of the algorithms.
This is due to the D1 dataset comprising of only 200 words and also the majority being single
morphemes themselves so by segmenting them it would be unproductive and does not really shed
the light on the capacity of the algorithm.
This is what induced the introduction of D2 and naturally as D2 contained every word same as
the gold standard file and hence the evaluation script was run on it.
3.1.6 X6
Aim of Algorithm: Gather Suffixes
Sampe of Results: Appendicies 8a, 8b, 8c, 8d
Size of initial dataset: 100,000
Size of data actually used I.e. >1: 62261
Number of Suffixes found: 3338
Evaluation Lists: D1 & D2
Pseudo Code:
Read in first 100,000 words from word list-save in array.
Extract the frequency
Compare frequency with the ASCII value 49 I.e. 1
If frequency > 49(1):
Store corresponding word in new array
Else:
Ignore the word, do not store in new array
Reverse each word in the array
Use Reversed array for prefix gathering.
Sort new array from shortest to longest order.
Run comparison module-store successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the
Affix array appears.
Re-reverse the output dataset so evaluation script can be run.
This algorithm followed the same steps as algorithm X4 but it served as a means to gather not
successor variety but predecessor variety values I.e. determining suffixes (Appendix 8a). Due to
the goal of this algorithm it was the reversal of the X4 algorithm, and a threshold was also set to
gather suffixes from >1 occurring words from a Data List of 100,000 words.
This was applied to the D1 Evaluation List I.e. sorted 200 most frequently occurring words and
also D2, however as mentioned above, the evaluation script was only run on the output from D2
(Appendix 8d).
A key note is that the same affix gathering technique was used in gathering the predecessor
varieties as the successor varieties. This was made possible by reversing each word in the array
and then gathering the successor varieties which in affect were the predecessor varieties. This is
the reason why the successor variety has not been changed to predecessor variety in the pseudo
code. The reversed set of the application of the algorithm is also available, see Appendix 8c.
D1 evaluation: not done, results of segmentation (Appendix 8b).
D2 evaluation (Appendix 8d):
Number of words in gold standard: 532 (type count)
Number of words in data set: 532 (type count)
Number of words evaluated: 532 (100.00% of all words in data set)
Morpheme boundary detections statistics:
F-measure: 28.53%
Precision: 18.93%
Recall: 57.92%
There is a slight improvement in the percentage of the F-Measure in this evaluation, the precision
remains similar. There isn’t any drastic change.
3.1.7 X7
Aim of Algorithm: Gather reliable suffixes.
Sample of results: Appendicies 9a, 9b, 9c
Size of initial dataset: 100,000
Size of data actually used I.e. >1: 25781
Number of Prefixes found: 373 (Appendix 9a)
Pseudo code:
Read in first 100,000 words from word list-save in array.
Extract the frequency
Compare frequency with the ASCII value 97 I.e. 10
If frequency > 97 (10):
Store corresponding word in new array
Else:
Ignore the word, do not store in new array
Reverse each word in the array
Use Reversed array for prefix gathering.
Sort new array from shortest to longest order.
Run comparison module-store successor varieties for each part of a word.
Calculate maximum successor variety for each word.
Add to Affix array
Run segmentation module based on Affix array segmenting where a word from the
Affix array appears.
Re-reverse the output dataset so evaluation script can be run.
This algorithm was the same as X6 but a threshold of >10 was set I.e. only inspect words that
occurred more than 10 times. It can be said that X6 and X7 had the same purpose as X4 and X5
but the only difference being was that they were to gather suffixes and not prefixes.
This was applied initially to the dataset D1 for comparison but the evaluation script was only run
on D2 dataset.
D1 evaluation: not done. Sample of Results (Appendix 9b)
D2 evaluation (Appendix 9c):
Number of words in gold standard: 532 (type count)
Number of words in data set: 532 (type count)
Number of words evaluated: 532 (100.00% of all words in data se
Morpheme boundary detections statistics:
F-measure: 28.13%
Precision: 18.70%
Recall: 56.75%
Again there is an increase of F-measure in a similar way to the Prefix Alogrithms (X4 & X5) and
the same behaviour in terms of affix reduction, a lot less suffixes are detected when the threshold
is increased to 10 (remember the threshold is set and increased to filter out noise and inaccurate
words from the initial dataset). However the evaluation results show that the accuracy remains
significantly unchanged so to conclude , any further development would be done using the prefix
and suffix lists generated form X5 and X7 (Appendicies 7 & 9) due to computational effort
reduction and similar result production. There would be no point in using algorithms X4 and X6
as it would be time consuming.
3.2 Summary of Evaluations
All 4 algorithms were compared on the dataset D1. It was seen that in X4 and X6 there were a lot
of prefixes and suffixes gathered compared to the numbers in X5 and X7 which is directly linked
to the frequency. There was not a lot of difference between the segmentation of X4 and X5 and
similarly between the segmentation of X6 and X7.
An interesting observation is that algorithms X4 and X6, the affixes that were gathered went up to
3 letters whereas in X5 and X7 with the greater filtering threshold the affixes went up to 2 letters.
It would be interesting to investigate if that could be the maximum number of letters that were
required to segment to a reasonably accurate extent.
Also, in all algorithms a check was set to not allow any affixes with length less than 2 which
serves as a filtering mechanism for noise but then the character ‘ ’ ’ would not be taken into the
affix array which is an obvious separator, so perhaps a compromise has to be sought between
allowing less than 2 character affixes but only if there is a very high frequency in them and
allowing greater chance of noise.
To make a better evaluation, the evaluation script from the Morphology challenge website was
run and the results gave an indication of the accuracy of each method.
When the comparison was made on the D1 dataset it was clearer than previous comparisons but
there was not a lot of breadth I.e. the 200 words that were the most frequently occurring words
were also single morphemes themselves so by segmenting them it did not give a great result as
the segmentation anyhow would be incorrect as they were already in the smallest morpheme
possible.
So to improve this analysis the same 4 algorithms were applied to dataset 2, D2.
D2
This dataset consisted of the same words that are used in the evaluation program. It was thought
it would be a good idea to use them as there would be diversity in the type of words I.e. there
would not be just small words, but larger words that would illustrate the proper segmenting limits
of the algorithms. Also, from this dataset a higher or more reliable evaluation result could be had.
This is because the evaluation dataset consists of 532 words and it looks through the dataset being
compared to it for those 532 words and gives accuracy results on how those words have been
segmented. In initial algorithms like X1 and X2 the result of such an evaluation would be quite
low due to the dataset only being 1000 words but it would definitely improve with X3 as the
dataset was increased to 100,000. For algorithms X4 and X5 the improvement in evaluation
result would be dependent on the thresholds used as the sample is at 100,000 words too. The
dataset D1 would not be much help for the evaluation as there would not be many common words
between them but the dataset D2 would be as precise as possible as all the words would be
matching.
Appendix C shows the summary of the 4 major algorithms
Only the 4 major algorithms are included in the summary table as they were seen to be the fittest
among the rest.
The percentage of affixes gathered does seem to be slightly higher for the suffixes, perhaps that
explains the slightly higher F-Measure found in them.
The values between each prefix and suffix set are quite similar hence the decision for any
improvements to be based on X5 and X7. The reason here can be illustrated by looking at the
size of the dataset that is used for affix extraction for >1 sets. It is more than double that of the
>10 set so it would cut down computation by minimum half.
Appendix D shows the different components of the program and the general flow.
The diagram is a general diagram to illustrate the workings of the algorithm implemented so far.
The input class is called for all the file input so the Evaluation List and the Data List are called in.
They are then if need be filtered(algorithms X4-X7), also when gathering the suffixes the array
can be passed onto the Reversal Class which simply reverses a given array list.
Then the AffixGather class gathers either prefixes or suffixes depending on which algorithm is
being run. The affix list is then trimmed to get rid of any trailing or leading spaces. This helps to
make sure that no extra spaces have been added to the affixes. The Segmentation class then takes
in the affix list after trimming and also the evaluation list. It then carries out segmentation and
can pass it back for trimming, this double checks the entries for any whitespace. The two arrays
are then passed onto the output Class which writes them to a file. This diagram is does not
exhaustively include the variables but it gives an outline of how the algorithm is implemented.
Putting everything into different classes allows the re-use of code which makes it easier to
program. The diagram does not imply that the Reversal Class can only be called after filtering,
many times in implementing the above algorithms I have had to pass the files to the Reversal
Class without filtering, every class is accessible independtly, the diagram below shows the
general flow of the algorithms.
3.21 Issues Encountered
There were a number of issues encountered so far throughout this implementation, apart from
many bugs in the programming that had to be solved to give accurate results one significant issue
was that of getting the evaluation script to work.
The evaluation script when first run would skip some words in the suggested file, after adding on
the ‘-trace’ argument in the command line it showed the output of each word according to hits
deletions and insertions. It kept on giving an error to the affect of:
Error (evaluation .perl): Mismatch in string Comparison. Char t vs. .
This caused a lot of problem, until it was realised that it was due to extra spaces from
segmentation at the end of words so all the word lists had to be trimmed to get rid of trailing and
leading spaces. After this the evaluation script ran fine, I think this point perhaps should be made
on the Morphology Challenge website if not already done.
Another confusing problem relates to when X6 was reached, to extract the suffixes the Evaluation
List, the Data List had to be reversed to gather the suffixes. It was here that I got confused and
did not realise that the suffixes that were output were actually in reverse order, I then used these
suffixes to segment the Evaluation List the right way around. This caused a lot of confusion and
for it to be corrected the procedure had to be repeated, all three lists had to be reversed and then
the segmentation also had to be carried out with the Evaluation List and Suffix list still reversed.
Only at the end could the output Result List be reversed to show the results.
There was also the issue of the time the algorithm took to filter the Data List and to gather the
affix lists and then segment. At first everything was in 1 class and it took a very very long time to
run, in the hours time duration. So to solve this problem the different parts of the program were
called separately and when the Data List had been filtered and saved to a file it could be re-used,
so there was no need to run the whole program again. For example the Data List for X5 was the
same as the Data List for X7, there was no need to gather the Data List again.
3.3 Implementation/Evaluation Phase II
From this point onwards it is important to note the following assumptions for the remaining
algorithms:
1. Three sets of word lists will be input into each.
a. Evaluation List, this is the same as the D2 dataset, the 532 words that the
evaluation script looks for in the results list. This will be the set list to be
segmented and the success of the algorithm will be based on how well this list of
words is segmented.
b. The Data list, this is the collection of words from the original Data List obtained
from the Morphology Challenge website. However it contains strictly only
words that have frequency greater than 10.
c. The affix list, this could either be the Prefix List or the Suffix list depending on
which algorithm is being run.
2. Previously, the algorithms X5 and X7 were used not only for affix extraction but also
segmentation but this is no longer the case. These algorithms have been adapted to serve
the purpose now to create the Data List and the Affix Lists only. However it is not
required to run them more than once, as running either one of them will produce the data
list and running them both will produce the Affix lists which can be consistently re-used.
This will save lots of time and computational effort.
3. The upcoming algorithms isolate the segmentation module and work on trying to
construct a good knowledge base; this is the reason why in the adapted code for X5 and
X7 the segmentation section has been removed. There was no knowledge base that could
differentiate between a good segmentation and a bad segmentation; it would simply add
segmentation wherever the affixes were detected. It would use all the affixes present to
segment resulting in lots of segmentations on a single word.
4. The Prefix and Suffix lists used for the upcoming algorithms were not the same ones
generated by X5 and X7. There was a great error noticed in those lists which is explained
in the Summary of Evaluations section (4.1). The lists can be viewed in Appendix 17 for
Prefixes and Appendix 17b for Suffixes.
The aim of this section is to try and develop some sort of knowledge base that can differentiate
between applying different segmentations whereby more accurate segmentations could be
attained.
As the algorithms X5 and X7 had the highest results, the prefixes and suffixes they have gathered
will be used as the basis so in effect the algorithms described below reuse code from X5 and X7
but hold a more comprehensive segmentation module.
Diagram to illustrate this:
Each algorithm takes as input 3 files, evaluation word list (words that will be segmented), data
list (all words that occurred more than 10 times in the corpus, this was obtained from running
algorithm X5) and an affix list.
So for example:
Appendix E.A Shows the running of the XX1 algorithm, the input and output in relation to lists.
This is a typical example of the first algorithm XX1; it takes in the Prefix List and Data List
which was derived using X5. It also takes in the Evaluation List, and the role of XX1 is
segmentation. Just like this example the rest of the algorithms will only differ in either having the
Suffix List input rather than the Prefix list and the process of segmentation. On each Evaluation
Result List the evaluation script will be run and the scores recorded for analysis.
A very important point is that when the algorithm is referred to for example XX1 and though it
has adapted code from X5 or X7, it should be thought of as a whole algorithm in its own right. It
is important not to be misled by the diagram above and assume XX1 is just the segmentation part;
the whole diagram is XX1 although the main focus for its creation is to enhance segmentation.
The purpose of the diagram is to help with clarity over the input and output.
The reason why clarity is so important in this aspect is that when the algorithms X4-X7 were
evaluated in Table 1 they were not thought of as segmentation modules but more so complete
algorithms, the upcoming algorithms should be thought of in the same manner, they are
collaborations of adapted old and new algorithms.
3.31 Summary of Development
Appendix E Shows the incremental stages of each algorithm and how each one relies on the other.
This diagram shows the process of developing the algorithms, it can be seen that they are all
based on one another; each is like a fine tuning of the one before. The final algorithm is a
combination of XX5 and XX6 known as XX7.
The structure of explaining the algorithms is similar to before but the results will be looked at
more closely and Global and Local Issues noticed will be stated. Global would be defining issues
thought to affect all of the algorithms and Local defining issues that are thought to be specific to
the algorithm. Also a comment on whether the algorithm was an improvement will also be made.
3.32 XX1
Result List: Appendix 10, Results of XX1
Pseudo Code:
Read in Data listRead in Evaluation listRead in prefix list
For int I:word.length-1 I++For int j:prefix list.length-1;j++
If word[I].contains(prefix[j])
Int prefixStart, prefixEndSuggested segment= words[i].substring(0,prefixStart)
+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);
String beforePrefIf datalist.contains(beforePref)
Count +=1Else
Count -=1Store prefix scoreStore suggested segment
ElseDo nothing
End loop
Find Max prefix scoreIf Max > 0
Add corresponding segment to outputArrayElse
Add original word to outputArrayEnd Loop
This pseudo code explains the function of algorithm XX1. The 3 sets of lists are read in, the
algorithm then iterates over the evaluation list checking if any of the entries from the prefix list
are present in the word. It investigates any occurrences of prefixes, stores their starting and
ending values present in the word and from that creates a suggested segment string. It then
divides the suggested string into 2 substrings, one substring is the string before the segmentation
and the other is the substring after segmentation.
N.B The phrase segmentation refers to the location where the “ “ empty string is added i.e. space.
The substring before the segmentation is known as beforePref and is checked for presence in the
Data List. Depending on if the substring is found the count value is either increased or decreased
by 1. At the end of the loop all the suggested segmentations are stored with their corresponding
count values. The count array is then checked for maximum count value which would indicate
the best prefix for that word. Also, the prefix is only accepted if the maximum value of the count
is greater than 0. This is to ensure that the beforePref substring is present in the Data List and
hence would be given a +1, indicating a positive score. If no count values are detected greater
than 0 then the original non-segmented word is added to the result.
The suggested segment corresponding to that prefix is then used as the final segmentation. After
this the algorithm moves on to the next word in the Evaluation List.
The purpose of these enhancements was to increase the integrity of the segmentation process.
The different prefixes that applied to a word could be seen and a count value was used to measure
the accuracy. This served as a means of increasing the reliability of the segmentation rather than
just segmenting any prefixes detected in a word as was done in the previous algorithms.
Appendix E1 shows the algorithm XX1 running on a word:
The prefix column shows all the prefixes that are common to the word and the suggestion column
shows what the suggested segment would be if the corresponding prefix was used. The count is
based on if the substring to the left hand side (beforePref) of the segmentation is in the Data List.
The program chooses the segment Aero planes’ as the selected one. It can be seen that though it
is not perfect compared to the gold standard as there is a space missing after the ‘e’ but there is a
correct segmentation there between ‘Aero’ and ’planes’. Also the second required segmentation
after ‘Aeroplane’ has been detected and a count score of 1 given which implies that the correct
segments are being detected. However the first suggestion picks up an incorrect segmentation,
the reason to this being that ‘Ae’ is detected as a word in the Data List even though ‘Ae’ is not a
valid word, hence this does add noise to the results; this problem is not isolated to this word.
To enhance this algorithm a further test should be done to check for both parts of a word, this
would add to the authenticity of a word, so for example ‘Ae’ may be picked up as a word but it is
highly unlikely that ‘roplanes’ will be picked up so the count value should decrease and in effect
not be picked up as a possible segmentation.
Similarly the same goes for most of those words segmented after the first two letters (Appendix
10, Results of XX1), words with prefixes such as ‘se’ and ‘so’ etc. In fact, most of the words in
this output are segmented after the first two letters simply because they have been detected as
words in the Data List.
Appendix E2 shows the highlight of another problem that was experienced:
This table shows the example of the word ‘agreeably’ and according to the semantics of the
algorithm the correct segmentation should have been chosen. However, though the correct
segmentation was detected, it was given a score of -1 due to the word ‘agree’ not being present in
the Data List. To remind, the Data List comprises of all words that had frequency greater than 10
in the corpus. This should have been a good means to test the integrity of the data list but this
example and the previous begin to suggest that the dataset perhaps isn’t as reliable as would have
expected.
The results gained from running the evaluation script on the output were:
Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 531 (99.81% of all words in data setMorpheme boundary detections statistics:F-measure: 21.74%Precision: 28.54%Recall: 17.56%
The results are an improvement compared to the previous algorithm runs where there was lower
precision (Table 1). This can be attributed to the added complexity of the segmentation stage.
Improvements gained from XX1:
1. Considers different suggestions of segmentations and gives a score based on the presence of
the first part of the word therefore the algorithm is based on more comprehensive evaluations
of the segmentations than before.
2. Improvement in precision.
Global Issues highlighted from XX1:
1. Dataset contains words that are not real words.
2. Dataset Is missing perfectly good English words which hinder this algorithm and any that
would want to check the presence of words from the data list, a larger sample of reliable
words would definitely help.
Local Issues highlighted from XX1:
1. Only considers first part of the word which may not be an actual word which would give a
count value of +1 and therefore the segmentation would be accepted.
2. Does not consider more than one segmentation at a time.
3. Does not consider the substring after the segmentation (afterPref).
3.3.3 XX2
Result List: Appendix 11, Results of XX2. Pseudo Code:
Read in data listRead in word listRead in prefix list
For int I:word.length-1 I++For int j:prefix list.length-1;j++
If word[I].contains(prefix[j])Int prefixStart, prefixEnd
Suggested segment= words[i].substring(0,prefixStart)+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);
String beforePref, afterPref
If datalist.contains(beforePref)Count +=1
Else Count -=1
If data list.contains(afterPref)Count +=1
Else Count -=1
Store prefix scoreStore suggested segment
ElseDo nothing
End loop
Find Max prefix scoreIf Max > 0
Add corresponding segment to outputArrayElse
Add original word to output ArrayEnd Loop
From the pseudo code it can be seen that the substring after the segmentation (afterPref) is also
checked for in the Data List. The score of +1 and -1 are still allocated depending upon success, a
point to be noted here is that a segmentation will only be accepted if the beforePref and afterPref
are present in the Data List due to the allocation of scores i.e. both parts must be present. This
can be seen as a stricter measure than before where only the beforePref substring was needed to
be found.
Because of such a measure, the majority of words from XX1 that were split right at the start
incorrectly have been corrected e.g. ‘unit y’ as opposed to ‘un ity’, ‘weigh ing’ as opposed to ‘we
ighing’ (Appendix 11, Results of XX2).
Appendix E3 shows a problem found in XX2:
Though from XX1, ‘action ‘s ’ is closer to the gold standard than the result of this XX2 it is
interesting to see that both segmentations that could lead to the gold standard are present here, for
suggestion 2 the substring ‘ ion’s ’ is not detected in the dataset and for suggestion 5 there is no ‘s
in the dataset. If these substrings were present then the Count value would be +1 for both.
This is a similar problem shown in Table E2 but then it cannot be expected for the Data List to
contain ‘ion’s ’ as it is not a word on it’s own but then again, the Data List contains all sorts of
entries from single letter entries to words that are not even words, this seems like an ongoing
problem. Because this algorithm checks both parts of a word for being present in the Data List it
may not split every word but that can be seen as being more reliable due to the increased
constriction.
Though ‘s is not in the Data List, if the algorithm was run with the suffix list, then the words
would have to be reversed and s’ is in fact a word in the Data List so it is quite possible that a
correct segmentation would be applied.
To back this point up further, in the results hardly any of the words ending in ‘s are segmented
which suggests the limitation of this algorithm being again dependent on the integrity of the Data
List.
The results gained from running the evaluation script on the output were:
F-measure: 23.64%Precision: 60.43%Recall: 14.69%
The precision has increased drastically, from the results of XX1. The main reason as explained
above due to less words being segmented at the start of the word. However a reasonable F-
Measure needs to be attained. Solving the problem highlighted in Table E3 could help.
Improvements gained from XX2:
1. An even more comprehensive means of evaluation for the count variable, considers both parts
of a word.
2. Improvement in precision
Global Issues highlighted from XX2:
1. Dataset entries still causing a problem.
Local Issues highlighted from XX2:
1. Substrings such as ion’s and ’s are not included in the data list, the second part of a word is
generally not in the dataset.
2. Does not consider more than one segmentation at a time.
3.3.4 XX3
The only difference in this algorithm from XX2 is that the reversed lists of words are read in and
that the suffix list is being used to derive the suggested segmentations instead of the prefix list.
Results List: Appendix 12, Results of XX3
Pseudo code:
Read in Data listRead in Evaluation listRead in suffix list
Reverse Data List, Evaluation List, Suffix List
For int I:word.length-1 I++For int j:prefix list.length-1;j++
If word[I].contains(suffix[j])Int suffixStart, suffixEnd
Suggested segment= words[i].substring(0,prefixStart)+words[i].substring(prefixStart,prefixEnd)+" "+words[i].substring(prefixEnd);
String beforeSuffix afterSuffix
If datalist.contains(beforeSuffix)Count +=1
Else Count -=1
If data list.contains(afterSuf)Count +=1
Else Count -=1
Store Suffix scoreStore suggested segment
The results gained from running the evaluation script on the output were:
Morpheme boundary detections statistics:F-measure: 19.53%Precision: 55.83%Recall: 11.83%
It can be seen that again the precision is quite high. The results set (Appendix 12, Results for
XX3) is quite similar with XX2 but in this algorithm It would have been expected the word with
‘s ending to be segmented, it does not seem to have done any of that.
Appendix E4 highlights the problem relative to XX3:
This example is relatively similar to E3 as it can be seen that the suggestions are quite accurate,
but there is trouble allocating the score. The reason why the 1st one didn’t get through was that
‘ings’ does not exist as a word on its own. Also for the second suggestion ‘s is not a word in the
Data List so the point was not given.
This algorithm has not been as fruitful as the results suggest reduction in Precision and F-
Measure hence XX2 will be looked at further to investigate fine tuning.
3.3.5 XX4
Results List: Appendix 13, Results for XX4
Pseudo Code:
This algorithm builds on XX2, the pseudo code was the same as XX2 but with a slight difference in the scoring section.
If data list.contains(beforePref)Count += beforePref.length()
Else Count -= beforePref.length()
If data list.contains(afterPref)Count +=1
Else Count -=1
Store Prefix scoreStore suggested segment
By the development of this algorithm it was realised from the previous evaluations of the
algorithms that the Data List was not as reliable as at first thought for the required algorithms and
it was unclear whether to obtain a further set of word lists from another source was permitted.
However it is also true that parts of words such as ‘ing’s’ are not proper words so in an ideal Data
List such words should not be present and a different approach should be devised to overcome
this issue. This lead to the deduction that perhaps if the beforePref was a proper word but the
afterPref didn’t necessarily have to be a word in the data list, what difference would it make.
The results gained from running the evaluation script on the output were:F-measure: 41.92%Precision: 57.37%Recall: 33.03%
This was so far the greatest score achieved with quite high precision too. The score given was
proportional to the maximum beforePref segment that could be found in the dataset. This was
especially aimed at words that had no segmentation done to them from the XX2 results e.g.
footing’s, summer’s, surprises’ and the focus was to try and get some sort of segmentation out of
them as there was no ‘s or ing’s in the dataset so an alternative like this algorithm was needed.
From the results list (Appendix 13, Results for XX4), it can be seen that this algorithm has been
successful in splitting these type of words, summer’s has been split into summer ‘s, surprises’ into
surprise ‘s and footing’s into footing ‘s. As there are numerous words as such in the results list it
can be assumed that it contributed highly to the result score.
However, there are still some words that are not segmented correctly.
Appendix E5 illustrates a general problem:
This table highlights the issue of counts with the same value; though the correct segmentation
was detected it wasn’t chosen as the full word reproached was found also in the Data List hence
making it the last entry in the array so when it came to calculating the maximum, the most recent
entry was taken. There were a couple of occurrences of this type of problem in the Results List,
however the results still show an improvement.
Another problem that was highlighted from the results list is illustrated in the next example.
Appendix E6-illustrates a problem specific to algorithm XX4.
This table highlights a flaw in the algorithm, the segmentation that is chosen is tel ephotograph,
though tel is not a word but it appears in the Data List so it is given +3 but ephotograph is a
ridiculous word to consider, it is far from being a word but only -1 is subtracted, this seems to be
a side affect of the algorithm as its main focus was the words that were not getting segmented in
the previous algorithms e.g summer’s. It had been decided that the beforePref substring had to be
a word but the afterPref didn’t necessarily have to be a word but the suggestion chosen here does
force to reconsider the strategy a little. However, not many words are affected by this problem,
that’s why the precision is still at a good level but for a good generic solution this strategy will
have to be modified.
Improvements from XX4:
1. F-Measure has doubled and also Precision increased, definitely scratching the surface of a
better approach.
Global Issues highlighted from XX4:
1. Dataset still causing problems but methods to work around it will need to be implemented.
Local Issues highlighted from XX4:
1. If the full word appears in the Data List it is given maximum count hence not segmented e.g.
E5 above.
2. The afterPref substring if not detected in the Data List, at maximum the count value decreases
by 1. Can give rise to quite obvious incorrect segmentations e.g. tel ephotograph, av uncular.
3.3.6 XX5
Results List: Appendix 14, Results for XX5
Pseudo Code:If datalist.contains(beforePref)
Count += beforePref.length()If count= word[I].length()
Count=0;
Else Count -= beforePref.length()
If data list.contains(afterPref)Count += afterPref.length()
Else Count -= afterPref.length()
Store Prefix scoreStore suggested segment
If word unchanged, store in remainder array.
To improve the algorithm XX4 further, in effect to focus on the local issues drawn from the
results of XX4 two modifications were proposed:
1. If the beforePref segment is equal to the length of the word, then set count to 0. This prevents
the word not being segmented e.g. example E5
2. Also to solve the 2nd Local Issue in XX4 the count value evaluation will be changed to be
proportional to both sides of the segmentation.
3. As an extra measure, words that are not segmented at all are stored in a separate file to be
analyzed further. These sets of words are known as the ‘Remainder’ words.
The results gained from running the evaluation script on the output were: Number of words in gold standard: 532 (type count)Number of words in data set: 308 (type count)Number of words evaluated: 308 (100.00% of all words in data set)Morpheme boundary detections statistics:F-measure: 59.62%Precision: 71.43%Recall: 51.16%
This is quite an improvement; the Precision again has been increased and so has the F-Measure
but then this result does not consider the ‘Remainder’ set of words(Appendix 14b, Remainder
List) so to check the remainder words along with this set of words:
Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 532 (100.00% of all words in data set)Morpheme boundary detections statistics:F-measure: 40.85%Precision: 71.43%Recall: 28.61%
The precision is still the same but the F-Measure seems to have dropped due to the reduction of
the Recall measure but as Recall is not the focus of evaluation it can be said that this overall
algorithm has resulted in improved results from the previous algorithms.
It can be seen from the results set (Appendix 14a, results for XX5) that words like reproached,
responding, agreeably that were not segmented in XX4 have been segmented correctly.
Appendix E7 shows an example of a problem.
This example shows the word broadcasting which was output as ‘Broadcasting’ from XX4. This
algorithm segments it as ‘Broad casting’ which is closer to the gold standard than before so that is
an improvement but the problem is even though it detects the corresponding gold segmentation it
does not carry out both of them (suggestions 4 & 9). This has been a limitation throughout the
results. If the algorithm is run again, it could segment based on the other option too but for now
segmentations like this have improved the score but do limit progressing to greater accuracy.
Also the ‘remainder words’ contained many words that should be segmented (Appendix 14b, Remainder List).
Appendix E8 higlights on another problem.
This example shows how the algorithm XX5 did not detect even one correct segmentation in
words like this hence it was shifted to the Remainder List. There are a few words like this present
in the Results list (Appendix 14a) that suffer from the same problem.
Improvements from XX5:
1. Higher precision and F-Measure achieved.
Global Issues highlighted from XX5:
1. More than one segmentation not considered
Local Issues highlighted from XX5:
1. Does not detect correct segmentations on some words, indicating limitations of the prefix list.
3.3.7 XX6
Results List: Appendix 15, Results for XX6
Pseudo Code:Read Remainder listRead Data listRead Suffix list
If datalist.contains(beforeSuff)Count += beforeSuff.length()
If count= word[I].length()Count=0;
Else Count -= beforeSuff.length()
If data list.contains(afterSuff)Count += afterSuff.length()
Else Count -= afterSuff.length()
Store Suffix scoreStore suggested segment
if word unchanged store in More Remainder list
This algorithm was focused on the local issue in the XX5 algorithm, the main aim was to achieve
segmentation on the words in the Remainder List as the prefixes did not detect any correct
segmentations.
The algorithm was the same as XX5, the only difference that instead of the prefix list being used
to find possible segmentations, the suffix list was used to generate suggestions.
Results:
Number of words in gold standard: 532 (type count)Number of words in data set: 34 (type count)Number of words evaluated: 33 (97.06% of all words in data set)Morpheme boundary detections statistics:F-measure: 63.41%Precision: 78.79%Recall: 53.06%
The evaluation script was run on the segmented words from the Remainder List and it can be seen
here that the F-Measure has risen along with the Precision which illustrates greater accuracy.
Though there were only 34 words that were segmented, it still shows an improvement. The
words that were not segmented were added to the ‘More Remainders’ List (Appendix 15b, More
Remainder).
However a few odd words have been segmented incorrectly after running this algorithm.
Table E9- incorrectly segmented word in XX6
It can be seen from this table that the word believe is segmented incorrectly as the first
suggestion, the reason to this being that the segment ‘belie’ is detected as a word in the Data List
which seems to be a recurring ongoing problem.
Global Issues highlighted from XX6:
1. Data List integrity
Local Issues highlight from XX6:
1. None detected
Improvements from XX6:
1. Increase in precision, more words segmented.
3.3.8 XX7
Results List: Appendix 16, Final Result XX7
This was the collaboration of algorithms XX5 and XX6, XX5 was run and then the words that
were added to the ‘Remainder’ word list had XX6 run on them. Then the 3 results lists were
combined, the segmented words after running XX5, the segmented after running XX6 and the
non-segmented More Remainder List.
Appendix F illustrates the running of the final algorithm, XX7.
To achieve the final result, the outputs from running the XX6 (segmented list and non-segmented
list) need to be combined with the output from XX5. This will give the overall result of all 532
words.
Results
Evaluation of segmentation in file "finaltest.txt" againstgold standard segmentation in file "eval.txt":Number of words in gold standard: 532 (type count)Number of words in data set: 532 (type count)Number of words evaluated: 532 (100.00% of all words in data setMorpheme boundary detections statistics:F-measure: 44.32%Precision: 72.14%Recall: 31.99%
This is the highest result from all the algorithms, though it is lower than the output of XX6 which
is due to XX6 only applied to 34 words. As an average and collaboration of XX5 and XX6 it is
quite good. It was only logical to use XX6 on the words that were not segmented by XX5 as
XX5 relied on prefixes and XX6 on suffixes. There still are 190 words in the More Remainder
List; it is possible that they are in their shortest morphemes but having a look at the More
Remainder List (Appendix 15b, More Remainder List) shows there still are words that could be
segmented. These are outlined in the Extending Section in the following chapter.
4. Summary of Evaluations/ Conclusion
Appendix H shows the results of all the major algorithms developed.
This table shows the review of scores from all the algorithms developed. The results show a
significant improvement from XX1 onwards. This is credited to the scoring system implemented
in the segmentation module rather than applying every single affix found in a word, See
Apendicies X1 and 3-8.
Algorithms X5 and X7 were adapted to be used as methods of extracting prefixes and suffixes
after the independence of the segmentation module was made. The end algorithm certainly seems
to work well with the English language.
To review, the strategy that was taken was to implement algorithms in increasing sophistication,
at each step of completion, evaluation was made and the most noticeable issues were stated. The
next increasing step would be focusing on trying to solve the problems found in the previous
algorithm.
4.1 Extending/Issues
The end algorithm XX7 gives a pretty good score but there are still issues within this overall
algorithm that were not solved due to time constraints.
1. The integrity of the Data List was an ongoing problem, at the start of the first implementation it
was thought that the Data List supplied was filled with complete and perfect English words. As
the development progressed it was realised that this was not the case and there were many
incorrect words within the data list. Measures were taken to try and improve the contents of the
Data List to filter noise out such as in X5 and X7; only including words that had frequency
greater than 10. Further measures were also added from the XX4 algorithm onwards to try and
overcome the vagueness of what exactly is present in the Data List. Even up to this point it is not
known what is present in the Data List, if anything it can be said the data list contains a random
collection of strings, I don’t think it would be accurate to say that it is a collection of words as
there are so many inconsistencies within. Perhaps if a more reliable Data List was provided I feel
a better solution could have been developed or even the current solution could work better.
However, the filtering procedure for words in the Data List could definitely be improved as it is
quite simple and a more complicated filtering approach could be devised to make the solution
better.
2. The affix lists were gathered using algorithms X5 and X7, the sample set was 100,000
words from which only 25781 words were used for affix extraction. The problem that was
discovered was when it came to implementing the new segmentation module which relied heavily
on the already extracted affix lists; the words after the letter ‘n’ were not examined in the Data
List. It was realised that the original Data List was alphabetically sorted and the 100,000 sample
only went up to words starting with n. So to overcome this problem the sample set was increased
to the whole Data List of 167,374 words. The amount of words that had greater than 10
frequency were 43329, that’s almost double the word sample used before which shows just how
much accuracy had been lost. Also the prefix total increased from 199 to 363! The suffixes
increased from 373 to 387. The results for X5 and the X7 algorithms were left as they were, there
did not seem to be a point using their segmentation as by the stage this problem was noticed the
separate segmentation module was being developed so the segmentation module (XX1-XX7)
used the new Data List, Prefix and Suffix lists as a basis which were extracted using the adapted
code of X5 and X7. It would not have made much difference if the X5 and X7 algorithms used
the increased lists; the segmentation part within them just segmented on every affix detected
leaving many spaces.
3. The segmentation does not consider more than one correct suggestion at a time I.e.
example E7. However it would not take much more effort to have XX7 check any similar values
and to apply that segmentation too in a recursive way.
4. The More Remainder list (Appendix 14b, More Remainder List), mostly contains words
that are made up of other words along with a suffix at the end, for example; aftercare’s. If it was
split into two ‘after’ and ‘care’s ‘ the word care’s would not be detected in the Data List which is
why the word remains un-segmented to this point. But to extend the algorithm, it can check
where the first segment is detected, between after and care’s. It can then check for a valid split in
the second segment ‘care’s’ on its own and if it exists it can infer that the split between after and
care’s is also valid. It can be extended to work like a recursive algorithm so to suggest the pseudo
code:
Read in Prefix, Suffix, Evaluation, Data Lists
Run XX5
For each suggestion,
split into beforePref and afterPref.
If beforePref in Data List, give beforePref.length score
Run XX6 on afterPref,
If valid segment present, accept segmentation after beforePref and in afterPref
Appendix G shows the expected running of this algorithm described by the above pseudo code on
Aftercare’s. The overall score for dividing care’s is greater than 0 so both segmentations would
be accepted. This would be the solution for words of this type.
The pseudo code could also include running XX5 on the beforePref segment just in case there
could be possible segmentations there too so in all, to run XX5 on the segment before the
proposed segment and to run XX6 on the segment after. I think that would be a good thorough
approach.
5. The successor variety is the main basis of affix extraction, if more complicated methods were
used or even combined with successor variety, for example conditional probabilities as described
in the background reading, it could really rocket the score.
6. The algorithm could be extended to work on other languages, though I was thinking of testing
the program on Turkish and Finnish not that it was a requirement of the Challenge, it was not
possible to do so due to the sheer quantity of the entries in their Data Lists, it kept causing my
system to crash, I was not even able to store them in a text file. The Turkish Data List contained
more than 500,000 entries and Finnish contained in excess of 1 million. There was not enough
time left to configure them. I also think if I did manage to store them and run the algorithm, it
would certainly take a long time, the prefix gathering with X5 on the full English dataset took
longer than 30 minutes, it would definitely take at least 3 times the amount of time to gather the
prefixes. It would have been good to test them, however there is nothing present in my algorithm
that suggests hard coding of material relevant to the English language, it is still a generic solution
even though I have not been able to test it on another language I am confident that it would give
an acceptable result.
Also not just segmenting but the other challenges on the Morphology Challenge website could be
considered.
4.2 Conclusion
Appendix I shows the scores from the Morphology Challenge website and I have added my score
at the bottom.
Looking at the scores I feel I have done well to get a good score based on very simple techniques
and intuition, seeing that most of the entries into the challenge were teams of professional
researchers I think I did well to achieve the score I did. I feel if I used more complicated methods
I would have an even higher score. The scores in bold in the table highlight those contestants of
whom my score was either greater than or close to, the total comes to 6/12 for my F-Measure and
as for Precision the scores in bold are 11/12 which seems to be very good.
I feel a good outcome has come from this but that there is great room for improvement, however
overall a good result has been achieved. The solution I have developed is generic and fulfils the
unsupervised criteria. I do feel I have exceeded my minimum Requirements, the minimum
requirement was to develop some sort of a solution, I have developed a solution that is actually
quite good. I do feel that I have had to rush the ending perhaps as the schedule I first stated in
my mid-term report I found it to be quite ambitious. I did not realise how long the programming
would take and just how many problems would be encountered, especially the debugging. Also
factors such as how many iterations would be required
Bibliography
Asmah Hj. Omar. 1986. Nahu Mutakhir Melayu. Kuala Lumpur: Dewan Bahasa dan Pustaka
Aronoff. A and Fudeman. A. (2005 ) What is Morphology . Wiley Blackwell publishers. pp.1-2
Dang, M and Choudri, S. 2005. Simple Unsupervised Morphology Analysis Algorithm(SUMAA). In Proceedings of Morphochallenge 2005.
Accessed from- http://www.cis.hut.fi/morphochallenge2005/P09_DangChoudri.pdf on 23/04/2009-4:11pm
Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics . - Accessed from- http://acl.ldc.upenn.edu/J/J01/J01-2001.pdf pp.153-189 on 15/04/2009-2:00am
Jalaluddin. N. H. (2008).European Journal of Social Sciences- http://www.eurojournals.com/ejss_7_2_09.pdf- pp.109-115.Acsessed 24/04/09- 3.00pm.
Jurafsky, D and Martin, J. (2008) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition . Prentice Hall Publishers. Pg. 377
Harris. Z 1955. From Phoneme to Morpheme. Language, 31(2):190–222.
Margaret A. Hafer and Stephen F. Weiss. 1974. Word Segmentation by Letter Successor Varities. Information Storage and Retrieval, 10:371–385.
Rehman, K and Hussain, I. (2005 ). Unsupervised Morphemes Segmentation . In Proceedings of Morphochallenge 2005.Accessed from- http://www.cis.hut.fi/morphochallenge2005/P10_RehmanHussain.pdf on 20/02/2009
Morphology Challenge 2005 homepage-
http://www.cis.hut.fi/morphochallenge2005/ Last accessed on 24/04/2009
Appendix A – Personal reflection
I have found this to be a very challenging project, I think I underestimated the amount of work
that this project required. Its important to set weekly goals and actually stick to them. I recall I
got quite laid back mid-way and also was ill for a few days which set my mid-term report behind
schedule. By looking at the schedule (Appendicies B and B2) it can be seen that I was quite
naïve in my initial plan. I took a very high level approach wheras I should have taken a low level
approach. Even now I am rushing through the last part of my report. However, I can say I have
learnt a lot in my project, I have become a much confident programmer for one, and also my
organisation skills have increased even though I have had a few set backs, but the best advice
would be being organised is a must.
When it comes to actually implementing the algorithms its important to set allowances for de
bugging as some bugs can be very annoying to solve and very time consuming. The final writeup
especially takes a lot longer than one might think. I was advised by my supervisor to start earlier
but I was too involved in the programming part. I think if I had a better plan of action I would
have been more successful.
Appendix B- Plan of Schedule
Appendix B2- Actual Plan
Phase I
Phase II
Appendix C- Shows the comparison of the 4 major algorithms.
Algorithm Initial
Dataset
Actual
Sample
used
Threshol
d
Affixes
gathered
Percentage
of Affix
gathered
compared
to sample
used.
F-
Measure
%
X4 100,000 62261 >1 2476 3.976807 24.84
X5 100,000 25781 >10 199 0.771886 24.51
X6 100,000 62261 >1 3338 5.361302 28.53
X7 100,000 25781 >10 373 1.446802 28.13
Appendix D- Diagram shows flow of algorithm for the main program
Appendix E- Figure Shows Development steps.
Input
evalName: StringdataName: StringevalSize: intdataSize: int
getEval(evalName, evalSize)getData(dataName, dataSize)
AffixGather
dataList: String [ ]
gAffix(dataList)
Filter
frequency: intdataName: String [ ]
filterDataList(frequency, dataName)
Segmentation
affixList: String [ ]evalList: String [ ]
segment(affixList, evalList)
Trim
_array: String [ ]
trimArray(_array)
Reversal
Arrary: String[]
rerverseArray(Array)
Output
affixArray: String [ ]resultArray: String [ ]
outputLists (affixArray,resultArray)
Appendix E.A Shows the running of the XX1 algorithm
XX1
Provides basis for
XX2XX3
XX4
XX5
XX6
XX7
Provides basis for
Provides basis for
Provides basis for
Final Algorithm
Provides basis for
X5
XX1
Running of XX1
Gathers Prefix list and filters Data List.
Segmentation Phase
Evaluation List
Evaluation Result List
The D2 data set
Inputs Prefix and Data List
Appendix E1-Table shows the running of Algorithm XX1 on a word.
Word Suggestions Prefix Count Segment Chosen Gold Standard
Aeroplanes’
1 Ae roplanes’ ae 1
2 Aeroplan es’ an -1
3 Aer oplanes’ er -1
4 aeropla nes' la -1
5 Aeroplane s’ ne 1
6 Aerop lanes’ op -1
7 aeropl anes' pl -1
8 Aero planes’ ro 1
9 Aeroplanes’ S’ -1
Aero planes’ Aero plane s’
Appendix E2- Table highlight on a problem discovered in algorithm XX1
Word Suggestions Prefix Count Score Segment Chosen Gold Standard
agreeably Agreeab ly ab -1
Ag reeably ag -1
Agreeabl y bl -1
Agreea bly ea -1
Agree ably ee -1
Agr eeably gr -1
agreeably ly -1
Agreeably Agree ab ly
Appendix E3-Highlights a problem found in algorithm XX2.
Word Suggestions Prefix Count Segment Chosen Gold Standard
Action’s
1 Ac tion’s ac 0
2 Act ion’s Ct 0
3 Actio n’s io -2
4 Action’ s N’ 0
5 Action ‘s on 0
6 Acti on’s ti -2
Action’s Act ion ‘s
Appendix E4-Highlight the problem relative to algorithm XX3
Word suggestions prefix count Chosen Segment
Gold standard
Footing’s
1 s'gn itoof gn -2
2 s'gnit oof it 0
3 s'gni toof ni 0
4 s'gnitoo f oo 0
5 s' gnitoof S’ 0
6 s'gnito of to 0
Footing’s Foot ing ‘s
Appendix E5-illustrating a general problem.
Word Suggestion Prefix Count Chosen Segment
Gold Standard
Reproached
1. Re proached re 2-1=1
2. Rep roached ep 3-1=23. Repr oached pr -1-1=-2
4. Repro ached Ro -1+1=0
5. Reproa ched oa -1 -1=-2
6. Reproach ed ch 8+2=10
7. Reproache d he -9+1=-8
8. Reproached ed +10
Reproached Reproach ed
Appendix E6- illustrates a problem relative to algorithm XX4
Word Suggestions Prefix
Count Chosen Segment
Gold Standard
Telephotograph
1. Te lephotograph te 2-1=1
2. Tel ephotograph el 3-1=2
3. Tele photograph le -4+1=-3
4. Telep hotograph ep -5-10=-15
5. Teleph otograph ph -6-9=-15
6. Telepho tograph ho -7-8=-15
7. Telephot ograph ot -8-7= -15
8. Telephoto graph to -9+1= -8
9. Telephotog raph og -10-5=-15
10. Telephotogr aph gr -11-4=-15
Tel ephotograph Tele photo graph
Appendix E7-Example of a problem from XX5
Word Suggestions Prefix Count Chosen Segment Gold Standard
Broadcasting
1. Br oadcasting Br -2 -10= -12
2. Bro adcasting Ro -3 -9= -12
3. Broa dcasting oa -4 -8= -12
4. Broad casting ad +5+7= +12
5. Broadc asting dc +6 -6= 0
6. Broadca sting Ca -7+5= -2
7. Broadcas ting as -8+4= -4
8. Broadcast ing st +9+3= +12
9. Broadcasting ng +12 >> 0
Broad casting Broad cast ing
Appendix E8- Highlighting another Problem, XX5
Word Suggestions Prefix Count Gold Standard
Adult’s Ad ult’s Ad 2-5= -3
Adu lt’s Du -3-4= -7
Adult’ s T’ -6+1= -5
Adul t’s Ul -4-3= -7
Adult ‘s
Appendix E9-Example of an incorrectly segmented word from algorithm XX6
Word Suggestions
Suffix Count Chosen segment Gold Standard
Believe
1. Belie ve ev +5+2=7
2. BelI eve ve -4+3=-1
3. Bel ieve ei 3-4=-1
4. Be lieve il 2-5=-3
5. B elieve le 1-6 =-5 Belie ve Believe
Appendix F-Diagram shows the running of the XX7 algorithm.
Appendix G-Diagram shows the running of algorithm describe in point 4 of extensions running on the word
Aftercare’s.
Appendix H- Shows the evaluation results of all the algorithms.
Evaluation List
Result 1 (308 words)
XX5 run
Remainder List (224 words)
Un-segmented words from running XX5
Result 2 (34 words) More Remainder List (190 words)
Final Result List
XX6 runUn-segmented words from running XX6
Running of XX7
Aftercare’s
After, Count+5 Care’s
Care, Count+4
‘s, Count -2
After detected in list Care’s not in list
After care ‘s
Care in list‘s not in list
Algorithm Size of Evaluation set
F-Measure Precision
X4 532 24.84 18.73%
X5 532 24.51 18.33%
X6 532 28.53 18.93%
X7 532 28.13 18.70%
XX1 532 21.74% 28.54%
XX2 532 23.64% 60.43%
XX3 532 19.53% 55.83%
XX4 532 41.92% 57.37%
XX5 532 40.85% 71.43%
XX6 34 63.41% 78.79%
XX7 532 44.32% 72.14%
Appendix I-Shows the results of the contestants participating in Morphology Challenge 2005 on the
English Language.
Name F-Measure PrecisionA1 49.8 44.7A2a 66.6 67.7A2b 62.4 55.2A3 32.0 24.1A4a 61.7 62.6A4b 58.5 61.2A5 53.8 50.6A6 76.8 76.2A7 48.0 47.1A8 36.2 32.5A9 28.5 22.9A10 43.7 37.5A11 45.7 57.1A12a 55.7 57.6Atif Score XX7 44.32 72.14
Table shows the F-Measure (%) & Precision(%) of the participants entries to Morphology
Challenge for the English Language. (Adapted from Kurimo et al, 2005).
Appendix 1, Evaluation List D1
aaboutafter allan andanyareas at be been beforebutby ca n coulddiddo firstforfromgoodgrea t had
has have he her himhisiifin intois it itsknowlikelittle mademan may memoremostmrmuch must
mynonotnowofononeonly orother ouroutover saidsee she shouldsosomesuch tha n that the thei rthem
then therethesethey thistimetotwoupuponusverywas wewellwerewhat when which who willwithwouldyouyour
Appendix 2, Evaluation List, D2
aboutaccelerateaccurstaction'sadjadult'saeroplanes'aftercare'sagreeablyairlettersalexics'allowance'salumamharic'sanalects'sanglicanism'sannual'santhropophagousapatheticallyappellationsaprils'archeryarmletsartier
asphyxiationassortingatonalaudition'sautobiographicallyavuncularbactriabalkingbandoleerbarbarity'sbarracudabastionsbazaar'sbefitbelievebeneficence'sbestiaries'bibliophiles'blackleggedbleats'bludgeoningboliviabopper'sboulevardbrandying
breastbone'sbridecake'sbroadcastingbrowbeatingbuffbullheadedness'sburdocksbushbabies'buzzers'cacklescalenderingcamelliascandelabrumscantilevers'cargocarsickness'scastaways'catechizedcave'scentaurs'chablis'chancel'scharismaticcheat'schessman
chimericalchroniccincturecleaverscliquierclottingcoadjutor'scochineal'scoercioncolonizescomfitscommodescompensatescomprehensive'sconciergesconduit'sconfucius'sconnection'sconsolidations'consumercontoursconvenecopyholdercornscorse's
cottontailcounterpointcovencrabgrass'scravatscremationcrispestcruciallycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathlesslydeceivers'decordefectsdegrees'delta'sdemoralizedepilation'sderidedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdisassociatingdiscountsdisgruntlesdismemberment'sdisputesdistastefulness'sdivergences'docketsdogwooddoorcasesdozeddrawingsdruggistdues'duress'eaglets'editionseggcupselderflower'selision'sembezzlementemperorsencodesengineenslavesentries'epsom's
eructatedestimators'evaluatingexactitudeexcretionexilesexpletive'sextenuationeyedropfactorizedfalsehood'sfarrier'sfaultilyfeignedfestivals'fieldmicefilmstripsfirthsflierflorists'flutes'foldfooting'sforegroundforkliftfossilizedfoxhuntingfreshersfrizzlesfryer'sfundamentals'fustygallinggapgaslightsgazinggent'sgharrygirdersglibglutengodparent'sgoods'grabs'granule'sgreatestgrimacesguardedgullies'gutturalshaghankeringsharmonic'shatfulsheadachierhearth's
heehaws'henbanesheterogeneouslyhipposhogsheadshominghootershorsewhips'housefatherhungryhydraulics'idiomilluminationimmunizationimpingements'improbableinbornincompetencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitivelyinspectorships'intakeinterjections'interringintroversion'sinvoiceirrelevances'iviesjanitor'sjerkyjobberjoules'jumblekappas'kibbutznikkinkedknight'skris'laddieslamps'lapelslatheslaxations'learnlegalizesleprechaun'sliability'sliedlights'linearliquorslivid
lockup'slongbowlorlovelornlugubriousness'luxuriantlymadeirasmaids'malcontentedmandarin'smanometricmargin'smarrowsmasseusematinsmaypole'smediatingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'missusmocha'smolehills'monkeys'mooniermorphemicsmotormouthwashesmulberrymuscatelmutt'snakednessnationalizationnearsightedness'negressesneurosesnicernile'snocturne'snonviolentnotaries'nuggets'nymphoobliteration'soccidentals'offeringokayingontology'sopposingordinalorthodontics'
outclassoutrages'overgrowths'overstatesoystercatcherspailfulspalliation'spanjabiparablepardoningparsonagespassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pinpointedpitchersplainchant'splatitudinouspleatplumberpoetasterpolicewomanpolypiportcullises'postcardpotionspractisedprecepts'prefabricationprepositionpressgangingprickedprioress'procrastinateprognosis'spromptnesspropositionsprotesterprunerspullets'punnetpursuespyrex's
quarantine'squestionnaires'quixoteraconteurs'railing'sraneesrate'sreams'reccesreconcilement'sredbreastrefereedrefutation'sreifiesrelictremountsrepayments'reproachedreservistrespondingretardationrevaluerevolvers'riddledrills'river'srondorotaroutines'ruefulruntssacking'ssages'salonssandals'sarongssauternesscallopedscientologistscowls'scripscurf'sseance'ssecretions'seesseneschal'ssepulchralservitudesextonshamelessnesssheaths'shiver'sshowrooms'
sibilant'ssiege'ssilksittings'skidpansskydiving'sslavers'slimslothfulness'ssmallssneerersociables'solariums'somnambulismsortie'ssouvenirspectresspigotspivviestspools'squanderer'sstabilizedstalingstaresstayedstiffenersstockbreeders'stoolies'straightensstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicantsurprise'sswanksswordfishsyndicaliststaborstales'tannin'starsitaxonomytec'stelephotographtenderfoottermite'sthai'sthermoplastic'sthistledown's
thrivesthwartstigress'ting'stontinestorments'toughnesstracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertypedumpteensunctuousness'undersecretariesuneventfulunityunrollsuphillusherettesvagueness'svariance'svenalvermifugevibrancy'svillainviscountcyvodkavulgarianswaitwant'swarrenwatchdogwaterloggedweariedweighingwestchesterwherrywidenwinterierwitticismswoodworkwretchedlyyam'sziggurats'
Appendix 3, Sample Results X1
aaabacaa haa raa aa'bab aab bab cab dab eab iab lab nab oab rab uaa ba'cac bac eac hab e nab e rab e taa 'sab i mab i tab b aab l eab l yab b eab b yab o oab o uaa geab r aab r iab soab thab a dab u hab u lab u nab u tab wy
ab d iab d uac adab a sac caac daac deab e dac e dac e hac e pac e sac e yab e lac h eab e t sab horab a teab i chab i deaa r onab i neaa ltoab jurab kerab b a dab l ahab b a sab l e rab l e sab a baab b éab n erab n orab b e sab o boab o deab o feab o ieab b e yab o o nab o o tab o rtab b ieab o u dab o u rab o u t
ab o veab b otab ac kab r ac ab r a mab r a raa staab r i sab r ouab r usab a d iab d ekab d elab a ftab u jaab d i cab d ouab u seab a jaab u t sab u zzab d u lab yanab ylaab yssab d u rab d u sab a loac aiaab e amab e arac cesac choac comac corac craac cusab e baab a naab e ggab a ndab e l lac e neaa r auab a s eac e s 'ac e to
ab e r tab e stab a s hab a d laab a eteab o lieab o lusab o ndsab b ac yab d y'saa h oroab o radab o rdaab o rdeaa a wwwab o rt oab b a 'sab b a teab o u ndab b a yeab a bbaab ac h aaa chenab e l l eab e l 'sab r a deab b e 'sab r a m oab r a m sab b é sab r egeab b e s sab e r t oab r oadab r ochab a nozab r uptab b e y sab sentab a risab so rbab squeab stinab surdab goveab hand
Appendix 4, Sample Result List of X2
Prefixes Gathered.abaaac
Samples from Result Listaa ab ac aa haa raa aa'bab aab bab cab dab e
ab iab lab nab oab rab uaa b a'cac bac eac hab en
ab erab etaa 'sab imab itab baab leab lyab beac edac ehac ep
ac esac eyab elac heab etsab horab ateab ichab ideaa ronab ineaa lto
ab jurab imesab yaneab ydosab cdefab iolaab ishaab asesac aciaab jectac adiaab adesac arusab jureaa rgauac cedeac censac centac ceptab deraac cess
ab dhurac cipeac ciusab lainac consab lardac cordac costab laufac crueab lazeac cuseab atedab diasab atesab lestab di'sab ditaab doolac eras
ab atisab azaiac eticab oardab duciac eyteab ductac cordionab ilitiesab scessesab schwungab utilonsac costingab lehnungac couchedab utmentsab ernethyac countedab ominablyac etab ulum
ab erconwayab ominatedac cessibleab ominatesac etonemiaac etosellaab botstownac cessionsab solutelyab bassidesab solutionab solutismac cidentalac cidentedac haemenesac cidentlyab solutistab solutoryac cipenserab reac tion
ab erdare'sab delhadizab erdeen'sab delkaderac claimingac harniansab ricosoffac clerlateac climatedac cidentallac cidentaly
ab surditiesab dujaparovab delrahmanab breviatedac arnaniansab iogenesisac clamationab derrahmanac cusationsab sorbinglyac climatise
ac climatizeac clivitiesab delkhalekab yssiniansac coceberryab origines'ab sorptionsab andonmentac commodateac customaryab errations
ac comodatedac customingac comodatesac companiedac companiesac companistac celeratedac celeratesac celeratorac complicesab senteeism
Appendix 5, Sample Result of X3
ab d el karimab so rb ent sab o riginalab b ot t abadac com odateab yss inia nab o rigine sac com pli ceac com pli shab r i dgmentab so rptionab so rptiveab hor renceac cor d anceab a te ment sab e r gwyllyab d i c atingab stain ingab huechundac ad emi c alac cor d natsab stemiousab stentionab d i c ationab o rt ive lyab stin enceab n egatingab n egationac coucheesac coucheurab r ogatingab str ac t edac count antab str ac t lyac count ethac count ingab r ogationab ject nessab o u halimaab e r nathy sab d u ct ion sac anthi ansac creditedab jur ationab n or mal lyab d el la tifac cretion sac anthylisab r upt nessab a nd on ingab e r rationac arnania nab d el salamac cuminateac cumulateab th eilungab treibungac curate lyab d el wahab
ab d el wahebab schnittsac cus ationac cus atoryab o ard shipab o ve boardab e r yswithab d ominal sab scond ersac cede ntemab scond ingab khazi a n sac cus ing lyac celerateab khazi a 'sac cus tom e dab a nd on nedab u n dance sab b reviateab u n dant lyab d era maneab d alla h 'sac cent uateab e n dsternab o lish ingac e p halousac cept ab l e ac cept ab l y ac cept a nceac e rbitiesab d u r ahmanab o lition sab b e rationab sent ionsab o minableab r a m o witzab o minablyac e t ab u l umab e r conwayab o minate dac ces s ibleab o minate sac e to ne miaac e to sellaab b ot s townac ces s ion sab so lute lyab b a s side sab so lutionab so lutismac cident alac cident edac h aemenesac cident lyab so lutistab so lutoryac cipe nserab r e ac tionab e r dare 's
ab d el hadizab e r deen 'sab d el kaderac claim ingac h arniansab r i cosoffac clerlateac climatedac cident al lac cident al yab surd itiesab d u japarovab d el rahmanab b reviate dac arnania n sab i ogenesisab e r crombieac cumulate dac cumulate sac cumulatorab r i dge mentab e r nethy 'sab o ve groundab o eocritusab o riginal sac clamationab d errahmanac cus ation sab so rb ing lyac climatiseac climatizeac clivitiesab d el khalekab yss inia n sac coceberryab o rigine s 'ab so rption sab a nd on mentac com modateac cus tom aryab e r ration sac com odate dac cus tom ingac com odate sac com paniedac com paniesac com panistac celerate dac celerate sac celeratorac com pli ce sab sent e e ismab d u l rahmanab b e ration sab o rt ion istab i gail shipab e l white 'sab hominableac ad emi c al s
ac cent uate dac cent uate sac ad emi c ianac cor d in g lyac ad emi c ienac e s todorusab stention sab r a m o vitchab e r ystwithac cept a nce sab o minabiliac cept a tionab n or mal ityab stin ent iaab e r ystwythab a nd on ed lyab e r gavennyab str ac t ingac count ab l e ac count ancyab str ac t ionac count ant sab u sive nessab str ac t iveab d era hmaneac anthaceaeab so lution sac ces s oriesac ces s oriusab str ac t o rsac ces s placeab o minatingac creditingab o minationab o u nd ing lyab a ntidas 'sab so lute nessab str ac t nessac ad emi c ian sab i t urientenac clamation sab e n cerragesab d ou japarovac cept a tion sac couchementac climatise dab a nd on ment sac climatize dab struse nessab r i dge ment sab o mination sab o u t ivenessac count ant 'sab b ot s bury'sab o rt ion ist sac cede ntibusac com mod ab l e ab b reviatingac com modate d
Appendix 6a, Sample of Prefixes X4
deaxdgakdhayabdiadazalbadndod'dpdqdrdudw
dyeaebecedeee'efegeheiejekelbeemeneoepeq
ereseteuevewexeyezb'faambifdfekhkiklaskn
kokrktkukvkwkylal'aicrleaaliatcsctlkllcu
lnlolpluaulvlxlycxlzczmadam'maravam'ama'masdai
matmaumawaiwmaxavemaymazdaldammeamecdanmedmeemegmehmeimelmem
darm'emenmepmerme'mesademetmeudasmewmezavimfudatmhodaumiamic
davmidmiedawmihmikmilmimmindaymirmismitdazmixmizavoavrddamll
ajadeammemndmnlmoaawamobmocmoddebmoemohmoimojmokmoldecmonded
moomopdeemormosmotmoumovmowmozawedefmplmp'mramrcdegmro
Appendix 6b, Sample of Results from running X4 on Evaluation Set D1:
aab outaf ter al lan an dan yar eas
at be be en be fo rebu tby ca n co ulddi d
do fi rstfo rfr omgo odgr ea t ha d ha s ha ve
he he r hi mhi siifin in tois
it it skn owli k eli t tle ma d ema n ma y me
mor emos tmrmuc h mus tmynonotnow
ofononeonly orothe r ouroutover
sai dsee she sho uldsosomesuch tha n tha t
the the i rthe m the n the r ethe s ethe y thi stim e
totwoupuponusver ywas wewel l
wer ewha t whe n whi ch who wil lwit hwouldyou
your
Appendix 6c, Sample of Results from running X4 on Evaluation Set D2:
ac ce le r at eac cu rst ac tio n's ad j ad ult'sae ropla n es 'af ter ca r e' sag ree a b ly ai rle t ter sal ex ic s 'al lo wan ce 'sal umam ha r ic 'san al ec t s'san gl ic a ni sm' san nual 's an thro popha g ousap at he t ic a l ly ap pel la t io nsap ril s'ar ch er yar mle t sar tier as phy xia t io nas sortin gat onal au di tio n'sau tobi ogr ap hi ca l ly avu ncu la r ba c tria ba l ki n gba n do le e rba r bar i ty'sba r rac uda ba s tio nsba z aa r's be fi t be li ev e be nef ic e nce' s be stia r ies ' bi bl io phi le s ' bl ac kl eg ge d bl ea t s' bl udg eo nin g bo li v ia pit ch er s pla i nch an t's pla t it udi nous ple a t plu mbe r poet as ter poli c e woma n poly pi portcu ll i s es ' postca r d potio ns prac tis ed prec e pts'
pref ab ric a t io n
Appendix 7a, Sample of Prefixes X5
Prefixa'aaabacadaeafagahaiajak
alamanaoapaqarasatauavaw
axayazb'babebhbibjblbobr
bubwbybÃc'cacdcechciclcn
cocrcsctcucyczd'krkukwky
l'lalelilllolplulvlxlym'
mambmcmdmemlmmmnmompmsmu
Appendix 7b, Sample of Results from running X5 on Evaluation Set D1
ab outaf ter al lan an dan yar eas at be be en be fo rebu t
by ca n co ulddi ddo fi rstfo rfr omgo odgr ea t ha d ha s ha v e
he he r hi mhi siifin in tois it it skn owli k e
li t tle ma d ema n ma y me mo remo stmrmu ch mu stmy nonot
nowofononeonly orothe r ouroutover sai dsee she
sho uldsosome such tha n tha t the the i rthe m the n the r ethe s ethe y
thi stim etotwoupuponusver ywas wewel lwer ewha t
whe n whi ch who wil lwit hwouldyouyour
Appendix 7c, Sample of Results from running X5 on Evaluation Set D2:
ac ce le r at e ac cu rst ac tio n's ad j ad ult's ae ropla n es ' af ter ca r e' s ag ree a b ly ai rle t ter s al ex ic s ' al lo wan ce 's al um am ha r ic ' s an al ec t s's ba l ki n g ba n do le e r ba r barit y's ba r rac uda ba s tio ns ba z aa r's be fi t be li ev e be nef ic e nce' s
br id e ca k e' s br oad ca s tin g br owbe at in g bu ff bu ll he a d ed nes s's bu rdo cks bu shba b ies ' bu zzer s' ca c kl es co nci er ge s co ndu it 's co nfu ci us's co nnec t io n's co nsoli d a t io ns' co nsume r co ntours co nven e co pyho lde r de at hle s sly de ce iv er s' de co r de fe c t s
de gr ee s' ex ac tit ude ex cr et io n ex il es ex ple t iv e' s ex ten uat io n ey ed r op fa c toriz ed fa l seh ood' s fa r rier 's nonvio le n t per am bu la t ors' per ip hras is per sec u tor's pes os' pha r is ai sm' s pho nec a l ls' phy sio lo gi sts' sau ter nes souven ir strep tomy ci n stupef ac tio n subl ieu ten an ts'
subten an ts' tel ep ho togr ap h ten de rfo ot ter mi t e' s tha i 's thi stle d o wn's thriv es thw ar ts tig r es s' tin g's unct uousnes s' unit y unroll s uphi ll ushe r et tes vag uen es s's var ia n ce 's wea r ied wei gh in g wes tch ester whe r ry wid e n win ter ier
wit tic i sms woodw ork
wret ch ed ly yam 's
Appendix 8a, Sample of Prefixes X6
utauteuthutiutoutsuttutuutyutzuumuusuutuvuvauveuwuxuxeuyuyauyeuysuytuyuuz
uzauzeuzhuziuzuuzyuzzv'nv'svavacvaevahvaivalvamvanvarvasvatvaxvayvdvevecved
vegvelvenvervesvetvexveyvezviviavicvidvieviivikvilvinviovirvisvitvivvixvkavna
vovodvolvonvorvosvotvowvoyvrvravrevrovryvsvs'vskvtvuvuevumvusvyvynw'dw's
wawahwakwalwamwanwarwaswatwaxwaywbawbywchwdwdsweweawebwedweewegweiwelwenwer
wesweywfuwhwiwiewigwinwkwkswlwlswlywmwnwndwnewnswnywogwolwoowrwsws'wse
wsywthwulwywydwynwzyx'sxaxasxexedxeixelxenxerxesxeyxixiaxicxiexiffusxiixinxip
xirxisxitxivxixxoxonxorxtxtaxtsxusxvxvixxxyxyly'dy'syayae
Appendix 8b, Sample of Results from running X6 on Evaluation Set D1
aab ou taf te r al lan an dan yar eas at be be en be fo re bu tby ca n co ul d di ddo fi rs tfo rfro m
go od gr ea t ha d ha s ha v ehe he r hi mhi siif in in to is it it skno wli k eli t tl e ma d ema n ma y
me mo re mo st mr mu ch mu st my no no tno wof on on e on ly or ot he r ou rou tov er sa i dse e sh e
sh o ul d so so m e su ch th a n th a t th e th e i rth e m th e n th e r eth e s eth e y th i sti m eto twoup up o n us ve r ywa s
Appendix 8c, Reversed Sample of Results of X6 on D2:
tu ob a et ar el ec c a ts r u cc a s' no it ca jda s' tl u da 's en al po r ea s' er ac re t fa yl ba e er ga sr e t te l ri a 's ci xe l a s' ec n aw ol l a mu la s' ci ra h ma s' s tc e la n a s' msi n ac il gn a s' la u nn a su og a h po p or ht na yl l a c it eh ta p a sn o it al le p pa 's li r pa yr e h cr a st el mr a re i tr a no it ai xy hp sa gn it ro s sa la n ot a
s' no it id u a yl l a c ih p ar go ib o tu a ra l ucn uv a ai rt ca b gn ik la b re e lo dn ab s' yt i r ab ra b ad uca r ra b sn o it sa b s' ra a za b ti f e b ev ei le b s' ec n eci fe n eb 's ei ra i ts e b 's el ih p oi l bi b de gg e l kc a l b 's ta e lb gn in oe g du lb ai vi l ob s' re p po b dr av el uo b gn iyd n ar b s' en ob 's re v el it na c og r ac s' s se n kc i sr a c 's ya w at sa c
de zi h c e ta c s' ev ac
's ru at ne c 's il ba h c s' le c n ah c ci ta m si r ah c s' ta e hc na m ss e h c la c ir em ih c ci no rh c er ut c n ic sr e v ae lc re i uqil c gn it to l c s' ro t uj da o c s' la e ni h c o c no ic r eo c se z in ol o c st if mo c se d o mm oc se t as ne p mo c s' ev is ne h er
Appendix 8d, Re-Reversed Sample of Results of X6 on D2:
a bo ut a c ce le ra te a cc u r st ac ti on 's adj ad u lt 's ae r op la ne s' af t er ca re 's ag re e ab ly a ir l et t e rs a l ex ic s' a l lo wa n ce 's al um am h ar ic 's a n al e ct s 's a ng li ca n ism 's a nn u al 's an th ro p op h a go us a p at he ti c a l ly ap p el la ti o ns
ap r il s' a rm le ts a rt i er as ph yx ia ti on as s or ti ng a to n al a u di ti on 's a ut o bi og ra p hi c a l ly a vu ncu l ar b ac tr ia b al ki ng ba nd ol e er b ar ba r i ty 's b ar r acu da b as ti o ns b az a ar 's b e f it b el ie ve be n ef ice n ce 's b e st i ar ie s' b ib l io p hi le s'
b l a ck l e gg ed bl e at s' bl ud g eo ni ng bo l iv ia b op p er 's b ou le va rd b ra n dyi ng br e a st bo ne 's br i de ca ke 's br o ad c as ti ng b r owb ea ti ng bu ff bu llh e ad ed n es s 's bu rd o c ks bu sh b ab ie s' b uz z er s' c a ck l es ca le n de ri ng ca m e ll i as ca n de l a br u ms c an ti le v er s'
ca r go c a rs i ck n es s 's c as ta w ay s' c at e c h iz ed ca ve 's c en ta ur s' c h ab li s'
c ha n c el 's c ha r is m at ic ch e at 's c h e ss m an c hi me ri c al c hr on ic ci n c tu re cl ea v e rs c liqu i er c l ot ti ng c o ad ju t or 's c o c h in e al 's
Appendix 9a: Sample of Prefixes X7
'd'l'm'n'r's'taaabacadaeaf
agahaiakakrakualamanaoapaqar
asatauavawaxayazbabbbebibo
bsbtbubycacechcickcocqcrcs
ctcydadddedidldndodsdtdudy
eaebecedeeefegeheiekelemen
ensentenyeoepeqerereergesesketeu
eumevewexeyezfafefffifo
Appendix 9b, Sample of Results from running X7 on Evaluation Set D1
c r is p e st cr uc i a l ly c ub ic c upb ea r er s' c ur t a il m e nt 's cyc la ma t es d a ff o d il s' da nd ru ff 's d ea th l es s ly d e c ei v er s' d e c or d ef ec ts d e gr ee s' d el ta 's d e mo ra li ze de pi la ti on 's d er i de de so la ti ng de t hr on ed d ew l ap 's d ic ta p ho ne s' di l d os d ip p er 's di s a ss oc ia ti ng di s co un ts d is gr u nt l es di sme m b er m e nt 's di s pu t es d is t as t ef ul n es s 's di v er ge n ce s' d oc ke ts
d ogwo od d o or ca s es d oz ed d ra wi n gs d ru g g i st d ue s' du r es s' e ag l et s' e di ti o ns e ggc u ps e l de rf lo w er 's e li si on 's e m b e zz le m e nt em pe r o rs e nc o d es e n gi ne e ns la v es e ntr ie s' e ps om 's er uc t at ed es ti ma t or s' e v al ua ti ng e x ac t it u de exc re ti on e xi l es exp le ti ve 's ex t en ua ti on ey ed r op f ac to r iz ed f al s e ho od 's f a rr i er 's fa ul t i ly fe i gn ed
f es ti v al s' f ie ldm i ce f i l m st r i ps f ir t hs fl i er f lo r i st s' f lu te s' fo ld f oo ti ng 's fo r eg r ou nd fo rkl i ft f os si l iz ed f oxh un ti ng f r es h e rs f r i zz l es fr y er 's f un da m en t al s' f u s ty g al li ng g ap ga sl i gh ts ga zi ng g e nt 's g h a r ry gi r d e rs gl ib g lu t en g od pa r e nt 's go od s' g r ab s' g r a nu le 's gr ea t e st g ri m a c es
g ua rd ed gu ll ie s' gu t tu r a ls h ag h an ke ri n gs ha rm on ic 's h atf u ls h ea d a ch i er h e ar th 's h ee h aw s' h en ba n es he te ro ge n e ou s ly h ip p os h og sh e a ds ho mi ng h oo t e rs h or s ew h ip s' h ou se f at h er hu n g ry h yd ra ul ic s' i di om i l lu mi na ti on i mmu ni za ti on im p in ge m e nt s' i mp ro b ab le i n bo rn inc om pe te n cy i nd e c ip he r a bi l i ty 's i n d iv id ua l is ts i n exo r ab ly i nf l ow 's
i n it ia te i nq ui si ti v e ly in sp ec t or s h ip s' in ta ke in t erj ec ti on s' in t er ri ng i nt ro v er si on 's in vo i ce ir re le va n ce s' iv i es
ja ni t or 's je r ky jo b b er j ou le s' ju mb le k ap pa s' ki bb u tz n ik ki nk ed kn i g ht 's k ri s' l a dd i es
l a mp s' la p e ls l at h es la xa ti on s' l ea rn le ga li z es l epr ec ha un 's l i a bi l i ty 's li ed l i g ht s' l in e ar
liq u o rs l iv id l o ck up 's lo ng b ow l or lo v e lo rn l ug ub r io us n es s' l ux ur i an
Appendix 9c, Sample of Results from running X7 on Evaluation Set D2
a bo ut ac ce le ra te acc u r st ac ti on 's adj ad u lt 's ae r op la ne s' af t er ca re 's ag re e ab ly a ir l et t e rs a l ex ic s' a to n al a u di ti on 's a ut o bi og ra p hi c a l ly avu ncu l ar b a ctr ia b al ki ng ba nd ol e er b ar ba r i ty 's b ar r acu da b as ti o ns b az a ar 's
b e f it b el ie ve be n ef ice n ce 's b e st i ar ie s' b ib l io p hi le s' b l a ck l e gg ed bl e at s' bl ud g eo ni ng bo l iv ia b op p er 's b ou le va rd b ra nd yi ng br e a st bo ne 's br i de ca ke 's br o ad c as ti ng b r owb ea ti ng bu ff bu llh e ad ed n es s 's bu rd o c ks bu sh b ab ie s' b uz z er s' c a ck l es
ca le n de ri ng ca m e ll i as ca n de l abr u ms c an ti le v er s' ca r go c a rs i ck n es s 's c as ta w ay s' c at e c h iz ed ca ve 's c en ta ur s' c h ab li s' c ha n c el 's c ha r is m at ic ch e at 's c h e ss m an c hi me ri c al c hr on ic ci n c tu re cl ea v e rs c liqu i er c l ot ti ng c o ad ju t or 's c o c h in e al 's
c oe r ci on c o lo ni z es c om fi ts co mm o d es c om p en sa t es c o mp re h en si ve 's co nc i er g es co nd u it 's co nf u c iu s 's co nn ec ti on 's con so li da ti on s' c o n su m er c on to u rs c on ve ne c o py ho l d er co r ns c or se 's c o tt on t a il co un te rp o i nt co v en
Appendix 10, Results of algorithm XX1
aboutaccelerate ac curstaction 'sad jad ult'saero planes'after care'sagreeably air lettersale xics'allow ance'salum am haric'sana lects'sang licanism'san nual'san thropophagousapathetic allyappellation sapril s'archery ar mletsar tierasp hyxiationas sortingat onalaudit ion'saut obiographicallyav uncularba ctriaba lkingband oleerbarbarity 'sba rracudabast ionsba zaar'sbe fitbelieve bene ficence'sbest iaries'bi bliophiles'blackleggedbleat s'bludgeon ingbolivia bo pper'sboulevard brand yingbreast bone'sbride cake'sbroadcast ingbrow beatingbuffbull headedness'sbur docksbus hbabies'buzzers'ca cklescalender ingcame lliasca ndelabrumsca ntilevers'cargo ca rsickness'scast aways'
ca techizedcave 'scentaur s'ch ablis'chancel 'sch arismaticche at'sche ssmanchime ricalchronic ci ncturecleave rscl iquierclo ttingcoadjutor 'scochin eal'scoercion colon izescom fitscom modescompensate scomprehensive 'scon ciergescon duit'sconfucius 'scon nection'scon solidations'con sumercontour scon venecopy holderco rnsco rse'scot tontailcounter pointcove ncrab grass'scravats cremation crisp estcrucial lycub iccup bearers'cur tailment'scyclamatesda ffodils'da ndruff'sdeath lesslydeceive rs'dec ordefects degrees 'delta 'sdemoralize de pilation'sder idedesolating dethroned dew lap'sdicta phones'di ldosdip per'sdis associatingdiscounts dis gruntlesdis memberment's
dispute sdistasteful ness'sdive rgences'doc ketsdog wooddoor casesdoze ddrawing sdrug gistdue s'du ress'eagle ts'edition seggcupselder flower'seli sion'sem bezzlementemperor sen codeseng ineenslave sentries 'epsom'ser uctatedest imators'eva luatingex actitudeexcretion exile sex pletive'sext enuationeye dropfacto rizedfalse hood'sfa rrier'sfa ultilyfeign edfe stivals'fi eldmicefilms tripsfirth sflier florist s'flute s'fol dfoot ing'sfore groundfor kliftfossil izedfox huntingfresh ersfr izzlesfry er'sfun damentals'fust ygalling ga pga slightsgazing ge nt'sgharrygird ersglib glut engod parent'sgoo ds'
gr abs'gr anule'sgreatest grim acesguard edgull ies'gut turalsha ghank eringsha rmonic'sha tfulshe adachierhearth 'she ehaws'he nbanesheterogeneous lyhip poshog sheadshoming hoot ershorse whips'house fatherhun gryhydra ulics'idiom illumination im munizationimp ingements'imp robablein bornin competencyind ecipherability'sind ividualistsinexorably in flow'sinitiate inquisitive lyinspector ships'intake in terjections'in terringin troversion'sin voiceirrelevances'iv iesjanitor 'sjerky job berjo ules'jumble
ka ppas'kibbutznikkin kedknight 'skr is'la ddieslamps 'la pelslath esla xations'le arnlegalize sle prechaun'sliability 'sli edlights 'line arliquor sli vidloc kup'slon gbowlor love lornlugubrious ness'luxuriantly madeira sma ids'ma lcontentedma ndarin'sma nometricma rgin'smarro ws
masse usema tinsma ypole'sme diatingmelodious ness'sme rcer'sme stizos'me zzosmi dshipmen'smine laying'smisadventure 'smi sfires'missus mo cha'smol ehills'mon keys'moon iermor phemicsmot ormouth washesmulberry mus catelmu tt'sna kednessnation alizationne arsightedness'ne gressesneuro sesni cerni le'sno cturne'snon violentnot aries'nuggets 'nymph oobliteration 'soccidentals'of feringok ayingonto logy'sop posingor dinalor thodontics'out classout rages'over growths'over statesoyster catcherspa ilfulspa lliation'spa njabipara blepa rdoningpa rsonagespa ssionflowers'path finder'spa wkiness'spe elers'pe ninsulape rambulators'pe riphrasispe rsecutor'speso s'
ph arisaism'sph onecalls'physiologists 'pi llars'pi npointedpi tcherspl ainchant'spl atitudinouspl eatplum berpo etasterpo licewomanpo lypipo rtcullises'post cardpo tionspractise dprecepts 'pre fabricationpre positionpre ssgangingpri ckedpri oress'pro crastinatepro gnosis'spro mptnesspropositi onsprotest erpr unerspull ets'pun netpursue spyre x'squ arantine'sque stionnaires'quixote ra conteurs'ra iling'sra neesrate 'sre ams're ccesre concilement'sre dbreastre fereedre futation'sre ifiesre lictre mountsre payments're proachedre servistre spondingre tardationre valuerevolve rs'riddle drill s'rive r'sro ndoro tarout ines'rueful
run tssa cking'ssa ges'sa lonssa ndals'sa rongssaute rnessc allopedsc ientologistsc owls'sc ripsc urf'sse ance'sse cretions'se esse neschal'sse pulchralse rvitudese xtonsh amelessnesssheath s'sh iver'ssh owrooms'si bilant'ssi ege'ssi lksi ttings'sk idpanssk ydiving'sslave rs'slim slothful ness'ssmall ssneer erso ciables'so lariums'so mnambulismso rtie'sso uvenirsp ectressp igotsp ivviestsp ools'sq uanderer'sst abilizedst alingst aresst ayedst iffenersst ockbreeders'st oolies'str aightensstr eptomycinst upefactionsub lieutenants'sub tenants'su ffragan'ssum mers'sun tanssup plicantsur prise'ssw ankssw ordfish
syndicaliststa borsta les'ta nnin'sta rsita xonomyte c'ste lephotographte nderfootte rmite'sth ai'sth ermoplastic'sth istledown'sthrive sth wartsti gress'ti ng'sto ntinestorments 'to ughnesstr acksuittr ammelledtr ansliteratingtr avelogue'str endsetters'tr ilogytr uncatetu lip'stweezertype dum pteensunctuous ness'un dersecretariesun eventfulun ityun rollsup hillus herettesva gueness'sva riance'sve nalve rmifugevi brancy'svi llainvi scountcyvo dkavu lgarianswa itwa nt'swa rrenwa tchdogwa terloggedwe ariedwe ighingwe stchesterwherrywidenwinterierwitticismswoodworkwretchedlyyam'sziggurats'
Appendix 11, Results of XX2
ab outaccelerateaccurstaction'sad jadult'saeroplanes'aftercare'sagree ablyair lettersalexics'allowance'sal umamharic'sanalects'sanglicanism'sannual'santhropophagousapathetic allyappellation saprils'archer yarmletsar tierasphyxiationas sortingatonalaudition'sautobiographical lyavuncularbactriabal kingbandoleerbarbarity'sbarracudabast ionsbazaar'sbe fitbelievebeneficence'sbestiaries'bibliophiles'blackleggedbleats'bludgeon ingbo liviabopper'sboulevardbrand yingbreastbone'sbridecake'sbroadcast ingbrow beatingbuffbullheadedness'sbur docksbushbabies'buzzers'cacklescalender ingcamelliascandelabrumscantilevers'car gocarsickness'scastaways'catechized
cave'scentaurs'chablis'chancel'scharismaticcheat'schessmanchimericalchroniccincturecleave rscliquierclottingcoadjutor'scochineal'scoercioncolonizescom fitscom modescompensate scomprehensive'sconcierge sconduit'sconfucius'sconnection'sconsolidations'consume rcontour sconvenecopy holdercornscorse'scotton tailcounter pointcove ncrabgrass'scravat scremationcrisp estcrucial lycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathlesslydeceivers'dec ordefect sdegrees'delta'sdemoralizedepilation'sder idedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdis associatingdis countsdisgruntlesdismemberment'sdispute sdistastefulness's
divergences'docket sdog wooddoor casesdoze ddrawing sdrug gistdues'duress'eaglets'edition seggcupselderflower'selision'sembezzlementemperor sen codesengineenslave sentries'epsom'seructatedestimators'evaluatingexactitudeexcretionexile sexpletive'sextenuationeye dropfactorizedfalsehood'sfarrier'sfaultilyfeign edfestivals'fieldmicefilms tripsfirth sflierflorists'flutes'fol dfooting'sfore groundforkliftfossilizedfox huntingfresher sfrizzlesfryer'sfundamentals'fust ygall ingga pgas lightsgazinggent'sgharrygirdersglibglut engod parent'sgoods'grabs'granule'sgreat est
grim acesguard edgullies'guttural sha ghankering sharmonic'shatfulsheadachierhearth'sheehaws'henbanesheterogeneous lyhipposhogshead sho minghootershorsewhips'house fatherhungryhydraulics'idiomilluminationimmunizationimpingements'im probablein bornin competencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitive lyinspector ships'in takeinterjections'inter ringintroversion'sin voiceirrelevances'iviesjanitor'sjerkyjob berjoules'jumblekappas'kibbutznikkinkedknight'skris'lad dieslamps'lap elslath eslaxations'lear nlegalize sleprechaun'sliability'sli edlights'line arliquor sliv idlockup's
long bowlo rlove lornlugubriousness'luxuriantlymadeira smaids'mal contentedmandarin'smanometricmargin'smarrow smasse usema tinsmaypole'smedia tingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'missusmocha'smolehills'monkeys'mooniermorphemicsmot ormouth washesmulberrymuscatelmutt'snaked nessnationalizationnearsightedness'negressesneuro sesnice rni le'snocturne'snon violentnotaries'nuggets'nymph oobliteration'soccidentals'offer ingokay ingontology'sop posingordinalorthodontics'out classoutrages'overgrowths'over statesoyster catcherspailfulspalliation'spanjabipar ablepardon ingparson ages
passionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pin pointedpitcher splainchant'splatitudinouspl eatplum berpoetasterpolice womanpolypiportcullises'post cardpo tionspractise dprecepts'pre fabricationpre positionpressgangingprickedprioress'procrastinateprognosis'sprompt nesspro positionsprotest erprune rspullets'pun netpursue spyre x'squarantine'squestionnaires'quixoteraconteurs'railing'sraneesrat e'sreams'reccesreconcilement'sred breastreferee drefutation'sreifiesrelic tre mountsrepayments'reproach edreservistrespond ingretardationre valuerevolvers'riddle d
rills'rive r'sron doro taroutines'rue fulrun tssac king'ssages'salon ssandals'sarongssauternesscallopedscientologistscowls'sc ripscurf'sseance'ssecretions'se esseneschal'ssepulchralservitudesex tonshamelessnesssheaths'shiver'sshowrooms'sibilant'ssiege'ssilksittings'skid pansskydiving'sslavers'slimslothfulness'ssmall ssneer ersociables'solariums'somnambulismsortie'ssouvenirspectre sspigotspivviestspools'squanderer'sstabilizedstalin gstare sstay edstiffenersstockbreeders'stoolies'straighten sstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicant
surprise'sswankssword fishsyndicaliststaborstales'tannin'star sitaxonomytec'stelephotographtender foottermite'sthai'sthermoplastic'sthistledown'sthrive sth wartstigress'ting'stontinestorments'tough nesstracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertype dumpteensunctuousness'under secretariesun eventfulunit yun rollsup hillusherettesvagueness'svariance'svena lvermifugevibrancy'svill ainviscountcyvodkavulgarianswa itwant'swarrenwatch dogwater loggedweariedweigh ingwest chesterwherrywi denwinterierwitticism swood workwretched lyyam'sziggurats'
Appendix 12, Results list for XX3
ab outaccelerateaccurstaction'sadjadult'saeroplanes'aftercare'sagree ablyair lettersalexics'allowance'sal umamharic'sanalects'sanglicanism'sannual'santhropophagousapathetic allyappellationsaprils'archeryarm letsar tierasphyxiationas sortingatonalaudition'sautobiographical lyavuncularbactriabalk ingbandoleerbarbarity'sbarracudabast ionsbazaar'sbe fitbelie vebeneficence'sbestiaries'bibliophiles'black leggedbleats'bludgeon ingb oliviabopper'sboulevardbran dyingbreastbone'sbridecake'sbroadcast ingbrow beatingbuffbullheadedness'sbur docksbushbabies'buzzers'cacklescalender ing
camelliascandelabrumscantilevers'c argocarsickness'scastaways'catechizedcave'scentaurs'chablis'chancel'scharismaticcheat'schess manchimericalchroniccincturecleave rscliquierclottingcoadjutor'scochineal'scoercioncolonizescom fitscom modescompensatescomprehensive'sconciergesconduit'sconfucius'sconnection'sconsolidations'consumercon toursconvenecopy holdercornscorse'scotton tailcounter pointc ovencrabgrass'scravatscremationcrisp estcrucial lycubiccupbearers'curtailment'scyclamatesdaffodils'dandruff'sdeathless lydeceivers'dec ordefectsdegrees'delta'sdemoralize
depilation'sde ridedesolatingdethroneddewlap'sdictaphones'dildosdipper'sdis associatingdis countsdisgruntlesdismemberment'sdisputesdistastefulness'sdivergences'docketsdogwooddoor casesdozeddrawingsdrug gistdues'duress'eaglets'edit ionseggcupselderflower'selision'sembezzlementemperorsen codesengineen slavesentries'epsom'seructatedestimators'evaluatingexactitudeexcretionex ilesexpletive'sextenuationeye dropfactorizedfalsehood'sfarrier'sfaultilyfeign edfestivals'field micefilm stripsfir thsflierflorists'flutes'f oldfooting'sfore groundfork lift
fossilizedfox huntingfreshersfrizzlesfryer'sfundamentals'fu stygall ingg apgas lightsgazinggent'sg harrygirdersg libglut engod parent'sgoods'grabs'granule'sgreat estgrim acesguard edgullies'gutturalsh aghankeringsharmonic'shatfulsheadachierh earth'sheehaws'henbanesheterogeneous lyhipposhogs headsho minghootershorsewhips'house fatherhungryhydraulics'idiomilluminationimmunizationimpingements'improbablein bornin competencyindecipherability'sindividualistsinexorablyinflow'sinitiateinquisitive lyinspector ships'in takeinterjections'inter ringintroversion's
in voiceirrelevances'iviesjanitor'sjerkyjob berjoules'j umblekappas'kibbutznikkinkedk night'skris'lad dieslamps'lap elslath eslaxations'l earnlegalizesleprechaun'sliability'sli edlights'line arliquorsliv idlockup'slong bowl orlove lornlugubriousness'luxuriant lymadeirasmaids'mal contentedmandarin'smanometricmargin'sm arrowsmasse usemat insmaypole'smedia tingmelodiousness'smercer'smestizos'mezzosmidshipmen'sminelaying'smisadventure'smisfires'miss usmocha'smolehills'monkeys'mooniermorphemicsmot ormouth washesmulberrymuscatel
mutt'snaked nessnationalizationnearsightedness'negress esneuro sesnicerni le'snocturne'snon violentnotaries'nuggets'nymphoobliteration'soccidentals'offer ingokay ingontology'sop posingordinalorthodontics'outclassoutrages'overgrowths'over statesoyster catcherspailfulspalliation'spanjabip arablepardon ingparson agespassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispersecutor'spesos'pharisaism'sphonecalls'physiologists'pillars'pin pointedpitchersplainchant'splatitudinouspl eatp lumberpoetasterpolicewomanpolypiportcullises'post cardpot ionspractisedprecepts'pre fabricationpre positionpressganging
prick edprioress'procrastinateprognosis'sprompt nesspro positionsprotest erprune rspullets'pun netpursuespyrex'squarantine'squestionnaires'quixoteraconteurs'railing'sraneesrate'sreams'reccesreconcilement'sredbreastrefereedrefutation'sreifiesrelictre mountsrepayments'reproach edreservistrespond ingretardationre valuerevolvers'riddledrills'river'sron doro taroutines'rue fulrun tssac king'ssages'salonssandals'sarongssauternesscallopedscientologistscowls'sc ripscurf'sseance'ssecretions'se esseneschal'ssepulchralservitudesex tonshameless ness
sheaths'shiver'sshowrooms'sibilant'ssiege'ssilksittings'skid pansskydiving'sslavers's limslothfulness'ssmallssneer ersociables'solariums'somnambulismsortie'ssouvenirspec tresspigotspivviestspools'squanderer'sstabilizedstalingstar esstay edstiffenersstockbreeders'stoolies'straightensstreptomycinstupefactionsublieutenants'subtenants'suffragan'ssummers'suntanssupplicantsurprise'sswankssword fishsyndicaliststaborstales'tannin'star sitaxonomytec'stelephotographtender foottermite'sthai'sthermoplastic'sthistledown'sthrivesth wartstigress'ting'stontinestorments'
tough nesstrack suittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetulip'stweezertyped
umpteensunctuousness'under secretariesun eventfulunityun rollsup hillusherettesvagueness'svariance'sven al
vermifugevibrancy'svilla inviscountcyvodkavulgarianswa itwant'swarrenwatch dogwater logged
weariedweigh ingwest chesterwherrywid enwinterierwitticismswoodworkwretched lyyam'sziggurats'
Appendix 13, Results list for XX4
aboutaccelerateac curstaction 'sad jad ult'saeroplanes 'after care'sagreeablyair lettersalex ics'allowance 'salumam haric'sanal ects'sang licanism'sannual 'san thropophagousapathetic allyappellation sapril s'archer yar mletsar tierasp hyxiationas sortingat onalaudit ion'sautobiographical lyav uncularba ctriabal kingband oleerbarbarity 'sbar racudabastion sbazaar 'sbe fitbelievebene ficence'sbest iaries'bi bliophiles'blackleggedbleat s'bludgeon ingboliviabo pper'sboulevardbrandy ingbreast bone'sbride cake'sbroadcastingbrow beatingbuffbull headedness'sbur docksbush babies'buzzers'cac klescalender ing
camelliascan delabrumscan tilevers'cargocar sickness'scast aways'cat echizedcave 'scentaur s'ch ablis'chancel 'schar ismaticcheat 'sche ssmanchimericalchronicci ncturecleave rscl iquierclo ttingcoadjutor 'scochin eal'scoercioncolon izescom fitscommode scompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'scon solidations'consume rcontour sconvenecopy holderco rnsco rse'scotton tailcounter pointcove ncrab grass'scravat scremationcrisp estcrucial lycubiccup bearers'curtail ment'scyclamatesda ffodils'dan druff'sdeath lesslydeceiver s'dec ordefect sdegrees 'delta 'sdemoralize
de pilation'sder idedesolatingdethroneddew lap'sdicta phones'di ldosdipper 'sdis associatingdiscountsdis gruntlesdis memberment'sdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdruggistdues 'du ress'eagle ts'edition seggcupselder flower'seli sion'sem bezzlementemperor sen codesengineenslave sentries 'epsom'ser uctatedest imators'evaluatingexactitudeexcretionexile sex pletive'sextenuationeye dropfactor izedfalsehood 'sfarrier 'sfa ultilyfeign edfestival s'fi eldmicefilms tripsfirth sflierflorist s'flutes 'fol dfooting 'sforegroundfor klift
fossil izedfox huntingfresher sfr izzlesfry er'sfundamental s'fust ygallingga pgas lightsgazinggen t'sgharrygird ersglibglut engod parent'sgoods 'grab s'gran ule'sgreatestgrimace sguard edgullies 'guttural sha ghankering sha rmonic'shat fulshead achierhearth 'shee haws'hen banesheterogeneous lyhip poshogshead shominghoot ershorse whips'house fatherhungryhydraulic s'idiomilluminationim munizationimpinge ments'improbablein bornincompetencyinde cipherability'sindividual istsinexorablyin flow'sinitiateinquisitive lyinspectorship s'intakeinter jections'inter ringin troversion's
in voiceirrelevances'iv iesjanitor 'sjerkyjob berjo ules'jumbleka ppas'kibbutznikkin kedknight 'skr is'lad dieslamps 'lap elslathe slax ations'lear nlegalize sle prechaun'sliability 'sli edlights 'line arliquor sliv idloc kup'slong bowlo rlove lornlugubrious ness'luxuriantlymadeira smaids 'mal contentedman darin'sman ometricmargin 'smarrow smasse usemat insmay pole'smedia tingmelodious ness'smer cer'smes tizos'me zzosmid shipmen'smine laying'smisadventure 'smis fires'missusmo cha'smole hills'monkey s'moon iermor phemicsmot ormouth washesmulberrymus catel
mu tt'snaked nessnational izationnear sightedness'ne gressesneurosesnice rnil e'snocturne 'snon violentnot aries'nuggets 'nymph oobliteration 'soccidentals'offeringokay ingonto logy'sopposingor dinalor thodontics'out classoutrages 'over growths'over statesoyster catcherspail fulspalliation 'span jabiparablepardon ingparsonage spassion flowers'path finder'spaw kiness'speel ers'peninsulaper ambulators'per iphrasispersecutor 'spesos 'ph arisaism'sphone calls'physiologists 'pillar s'pin pointedpitcher splain chant'spla titudinousplea tplumb erpoet asterpolice womanpolypipor tcullises'postcardpotion spractise dprecepts 'pre fabricationpre positionpres sganging
prickedprior ess'proc rastinatepro gnosis'sprompt nessproposition sprotest erpruner spullet s'pun netpursue spyre x'squarantine 'squestionnaire s'quixotera conteurs'railing 'sran eesrat e'sreams 're ccesreconcile ment'sred breastreferee drefutation 'sre ifiesrelic tre mountsrepay ments'reproachedres ervistrespondingretardationre valuerevolver s'riddle drill s'rive r'sron doro taroutine s'ruefulrun tssacking 'ssages 'salon ssandal s'sar ongssaute rnessc allopedsci entologistsc owls'sc ripsc urf'sseance 'ssecretion s'see ssen eschal'ssepulchralservitudesextonshame lessness
sheaths 'shiver 'sshow rooms'si bilant'ssiege 'ssi lksitting s'skid panssky diving'sslave rs'slimslothful ness'ssmall ssneer ersociable s'solar iums'somnambulismso rtie'ssouvenirspectre ssp igotsp ivviestspool s'squander er'sstabilizedstalin gstare sstayedstiffen ersst ockbreeders'stool ies'straighten sstr eptomycinstupefactionsub lieutenants'sub tenants'su ffragan'ssummer s'sun tanssup plicantsurprise 'sswan kssword fishsyndicaliststab orstales 'tan nin'starsitax onomyte c'stel ephotographtender footter mite'sthai 'sther moplastic'sthistle down'sthrive sth wartsti gress'ting 'ston tinestorments '
tough nesstr acksuittram melledtr ansliteratingtravel ogue'strends etters'trilogytr uncatetulip 'stweezertype d
um pteensunctuous ness'under secretariesuneventfulunit yunroll suphillusher ettesvague ness'svariance 'svena l
ver mifugevi brancy'svilla invis countcyvodkavulgar ianswa itwan t'swarrenwatchdogwater logged
weariedweighingwest chesterwherrywide nwinter ierwitticism swood workwretchedlyya m'szig gurats'
Appendix 14a, Results for XX5
action 'sad jaeroplanes 'agree ablyair lettersallowance 'sal umannual 'sapathetic allyappellation sapril s'archer yar tieras sortingautobiographical lybal kingbarbarity 'sbast ionsbazaar 'sbe fitbleat s'bludgeon ingbo liviaboule vardbrand yingbroadcast ingbrow beatingbur dockscalender ingcamel liascar gocave 'scentaur s'chancel 'scheat 'scleave rscoadjutor 'scochin eal'scolon izescom fitscom modescompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'sconsume rcontour scopy holdercotton tailcounter pointcove ncravat scrisp estcrucial lycub iccurtail ment's
deceiver s'dec ordefect sdegrees 'delta 'sder idedipper 'sdis associatingdis countsdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdrug gistdues 'eagle ts'edition semperor sen codesenslave sentries 'exile seye dropfactor izedfalsehood 'sfarrier 'sfeign edfestival s'films tripsfirth sflorist s'flutes 'fol dfooting 'sfore groundfossil izedfox huntingfresher sfundamental s'fust ygall ingga pgas lightsgird ersglut engod parent'sgoods 'grab s'great estgrim acesguard edgullies 'guttural sha g
hankering shearth 'sheterogeneous lyhogshead sho minghoot ershouse fatherhung ryhydraulic s'impinge ments'im probablein bornin competencyindividual istsinquisitive lyinspector ships'in takeinter ringin voicejanitor 'sjob berknight 'slad dieslamps 'lap elslath eslear nlegalize sliability 'sli edlights 'line arliquor sliv idlong bowlo rlove lornlugubrious ness'madeira smaids 'mal contentedmano metricmargin 'smarrow smasse usema tinsmedia tingmelodious ness'smisadventure 'smonkey s'moon iermot ormouth washesmul berrynaked nessnational izationneuro sesnice r
ni le'snocturne 'snon violentnuggets 'nymph oobliteration 'soffer ingokay ingop posingout classoutrages 'over statesoyster catcherspalliation 'spar ablepardon ingparson agespersecutor 'spesos 'physiologists 'pillar s'pin pointedpitcher spl eatplum berpolice womanpost cardpo tionspractise dprecepts 'pre fabricationpre positionprior ess'prompt nesspro positionsprotest erprune rspullet s'pun netpursue spyre x'squarantine 'squestionnaire s'railing 'srat e'sreams 'reconcile ment'sred breastreferee drefutation 'srelic tre mountsreproach edrespond ingretard ationre valuerevolver s'riddle d
rill s'rive r'sron doro taroutine s'rue fulrun tssac king'ssages 'salon ssandal s'saute rnessc ripseance 'ssecretion s'se essex tonsheaths 'shiver 'ssiege 'ssitting s'skid pansslave rs'slothful ness'ssmall ssneer ersociable s'spectre sspool s'squander er'sstalin gstare sstay edstiffen ersstool ies'straighten sstupe factionsummer s'surprise 'sswan kssword fishtales 'tar sitele photographtender footthai 'sthistle down'sthrive sth wartsting 'storments 'tough nesstulip 'stype dump teensunctuous ness'under secretariesun eventfulunit yun rollsup hillvariance 's
vena lvill ainvulgar ianswa itwatch dogwater loggedwear iedweigh ingwest chesterwi denwinter ierwitticism swood workwretched ly
Appendix 14b, Remainder List
aboutaccelerateaccurstadult'saftercare'salexics'amharic'sanalects'sanglicanism'santhropophagousarmletsasphyxiationatonalaudition'savuncularbactriabandoleerbarracudabelievebeneficence'sbestiaries'bibliophiles'blackleggedbopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchessmanchimericalchroniccincturecliquierclottingcoercionconsolidations'convenecornscorse'scrabgrass'scremationcupbearers'cyclamatesdaffodils'dandruff'sdeathlesslydemoralizedepilation'sdesolating
dethroneddewlap'sdictaphones'dildosdisgruntlesdismemberment'sduress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationfaultilyfieldmiceflierforkliftfrizzlesfryer'sgazinggent'sgharryglibgranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiateinterjections'introversion'sirrelevances'iviesjerkyjoules'jumblekappas'kibbutznikkinkedkris'laxations'leprechaun'slockup's
luxuriantlymandarin'smaypole'smercer'smestizos'mezzosmidshipmen'sminelaying'smisfires'missusmocha'smolehills'morphemicsmuscatelmutt'snearsightedness'negressesnotaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprickedprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesrepayments'reservistsarongsscallopedscientologistscowls'scurf'sseneschal'ssepulchralservitudeshamelessness
showrooms'sibilant'ssilkskydiving'sslimsolariums'somnambulismsortie'ssouvenirspigotspivvieststabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssuntanssupplicantsyndicaliststaborstannin'staxonomytec'stermite'sthermoplastic'stigress'tontinestracksuittrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetweezerusherettesvagueness'svermifugevibrancy'sviscountcyvodkawant'swarrenwherryyam'sziggurats'
Appendix 15a, Results for XX6
aboutadult 'sanglican ism'sarm letsa tonalbelie vebeneficence 'sblack leggedchess man
consolidation s'deathless lydismemberment 'sfault ilyfield micefork liftgent 'sg harryg lib
j umbleluxuriant lymercer 'smiss usnegress esprick edrepayment s'scowl s'shameless ness
s limsortie 'ssunt anstrack suitvagueness 'sviscount cywant 's
Appendix 15b, More Remainder List
accelerateaccurstaftercare'salexics'amharic'sanalects'santhropophagousasphyxiationaudition'savuncularbactriabandoleerbarracudabestiaries'bibliophiles'bopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchimericalchroniccincturecliquierclottingcoercionconvenecornscorse'scrabgrass'scremationcupbearers'cyclamates
daffodils'dandruff'sdemoralizedepilation'sdesolatingdethroneddewlap'sdictaphones'dildosdisgruntlesduress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationflierfrizzlesfryer'sgazinggranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiate
interjections'introversion'sirrelevances'iviesjerkyjoules'kappas'kibbutznikkinkedkris'laxations'leprechaun'slockup'smandarin'smaypole'smestizos'mezzosmidshipmen'sminelaying'smisfires'mocha'smolehills'morphemicsmuscatelmutt'snearsightedness'notaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'
plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesreservistsarongsscallopedscientologistscurf'sseneschal'ssepulchralservitudeshowrooms'sibilant'ssilkskydiving'ssolariums'somnambulismsouvenirspigotspivvieststabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssupplicantsyndicaliststaborstannin'staxonomytec'stermite's
thermoplastic'stigress'tontinestrammelledtransliterating
travelogue'strendsetters'trilogytruncatetweezer
usherettesvermifugevibrancy'svodkawarren
wherryyam'sziggurats'
Appendix 16, Results for XX7
accelerateaccurstaftercare'salexics'amharic'sanalects'santhropophagousasphyxiationaudition'savuncularbactriabandoleerbarracudabestiaries'bibliophiles'bopper'sbreastbone'sbridecake'sbuffbullheadedness'sbushbabies'buzzers'cacklescandelabrumscantilevers'carsickness'scastaways'catechizedchablis'charismaticchimericalchroniccincturecliquierclottingcoercionconvenecornscorse'scrabgrass'scremationcupbearers'cyclamatesdaffodils'dandruff'sdemoralizedepilation'sdesolatingdethroneddewlap'sdictaphones'dildosdisgruntles
duress'eggcupselderflower'selision'sembezzlementengineepsom'seructatedestimators'evaluatingexactitudeexcretionexpletive'sextenuationflierfrizzlesfryer'sgazinggranule'sharmonic'shatfulsheadachierheehaws'henbaneshipposhorsewhips'idiomilluminationimmunizationindecipherability'sinexorablyinflow'sinitiateinterjections'introversion'sirrelevances'iviesjerkyjoules'kappas'kibbutznikkinkedkris'laxations'leprechaun'slockup'smandarin'smaypole'smestizos'mezzosmidshipmen'sminelaying'smisfires'
mocha'smolehills'morphemicsmuscatelmutt'snearsightedness'notaries'occidentals'ontology'sordinalorthodontics'overgrowths'pailfulspanjabipassionflowers'pathfinder'spawkiness'speelers'peninsulaperambulators'periphrasispharisaism'sphonecalls'plainchant'splatitudinouspoetasterpolypiportcullises'pressgangingprocrastinateprognosis'squixoteraconteurs'raneesreccesreifiesreservistsarongsscallopedscientologistscurf'sseneschal'ssepulchralservitudeshowrooms'sibilant'ssilkskydiving'ssolariums'somnambulismsouvenirspigotspivviest
stabilizedstockbreeders'streptomycinsublieutenants'subtenants'suffragan'ssupplicantsyndicaliststaborstannin'staxonomytec'stermite'sthermoplastic'stigress'tontinestrammelledtransliteratingtravelogue'strendsetters'trilogytruncatetweezerusherettesvermifugevibrancy'svodkawarrenwherryyam'sziggurats'aboutadult 'sanglican ism'sarm letsa tonalbelie vebeneficence 'sblack leggedchess manconsolidation s'deathless lydismemberment 'sfault ilyfield micefork liftgent 'sg harryg libj umbleluxuriant lymercer 'smiss us
negress esprick edrepayment s'scowl s'shameless nesss limsortie 'ssunt anstrack suitvagueness 'sviscount cywant 'saction 'sad jaeroplanes 'agree ablyair lettersallowance 'sal umannual 'sapathetic allyappellation sapril s'archer yar tieras sortingautobiographical lybal kingbarbarity 'sbast ionsbazaar 'sbe fitbleat s'bludgeon ingbo liviaboule vardbrand yingbroadcast ingbrow beatingbur dockscalender ingcamel liascar gocave 'scentaur s'chancel 'scheat 'scleave rscoadjutor 'scochin eal'scolon izescom fitscom modescompensate scomprehensive 'sconcierge sconduit 'sconfucius 'sconnect ion'sconsume rcontour scopy holder
cotton tailcounter pointcove ncravat scrisp estcrucial lycub iccurtail ment'sdeceiver s'dec ordefect sdegrees 'delta 'sder idedipper 'sdis associatingdis countsdispute sdistasteful ness'sdivergence s'docket sdog wooddoor casesdoze ddrawing sdrug gistdues 'eagle ts'edition semperor sen codesenslave sentries 'exile seye dropfactor izedfalsehood 'sfarrier 'sfeign edfestival s'films tripsfirth sflorist s'flutes 'fol dfooting 'sfore groundfossil izedfox huntingfresher sfundamental s'fust ygall ingga pgas lightsgird ersglut engod parent'sgoods 'grab s'great estgrim aces
guard edgullies 'guttural sha ghankering shearth 'sheterogeneous lyhogshead sho minghoot ershouse fatherhung ryhydraulic s'impinge ments'im probablein bornin competencyindividual istsinquisitive lyinspector ships'in takeinter ringin voicejanitor 'sjob berknight 'slad dieslamps 'lap elslath eslear nlegalize sliability 'sli edlights 'line arliquor sliv idlong bowlo rlove lornlugubrious ness'madeira smaids 'mal contentedmano metricmargin 'smarrow smasse usema tinsmedia tingmelodious ness'smisadventure 'smonkey s'moon iermot ormouth washesmul berrynaked nessnational izationneuro sesnice r
ni le'snocturne 'snon violentnuggets 'nymph oobliteration 'soffer ingokay ingop posingout classoutrages 'over statesoyster catcherspalliation 'spar ablepardon ingparson agespersecutor 'spesos 'physiologists 'pillar s'pin pointedpitcher spl eatplum berpolice womanpost cardpo tionspractise dprecepts 'pre fabricationpre positionprior ess'prompt nesspro positionsprotest erprune rspullet s'pun netpursue spyre x'squarantine 'squestionnaire s'railing 'srat e'sreams 'reconcile ment'sred breastreferee drefutation 'srelic tre mountsreproach edrespond ingretard ationre valuerevolver s'riddle drill s'rive r'sron doro ta
routine s'rue fulrun tssac king'ssages 'salon ssandal s'saute rnessc ripseance 'ssecretion s'se essex tonsheaths 'shiver 'ssiege 'ssitting s'skid pansslave rs'
slothful ness'ssmall ssneer ersociable s'spectre sspool s'squander er'sstalin gstare sstay edstiffen ersstool ies'straighten sstupe factionsummer s'surprise 'sswan kssword fishtales '
tar sitele photographtender footthai 'sthistle down'sthrive sth wartsting 'storments 'tough nesstulip 'stype dump teensunctuous ness'under secretariesun eventfulunit yun rollsup hill
variance 'svena lvill ainvulgar ianswa itwatch dogwater loggedwear iedweigh ingwest chesterwi denwinter ierwitticism swood workwretched ly
Appendix 17, Prefix List used in Algorithms XX1-XX7
There are 363 in total and segmentation occurs after the last letter of the prefix.
a'aaabacadaeafagahaiajakalamanaoapaqarasatauavawaxayazb'babebhbibjblbobrbubwbybÃc'cacdcechci
clcncocrcsctcucyczd'dadbdddedhdidjdmdodpdrdsdudwdydze'eaebecedeeefegeheiejekelemeneoepeqeres
eteuevewexeyezfafefifjflfofrfsfufvfygageghgiglgngogrgugwgygzh'hahehihohshthuhwhyi'iaibicidig
iiikiliminioipiqirisitiviwizj'jajejijkjojukakckekhkikjklknkokrkukwkyl'lalelilllolplulvlxlym'
mambmcmdmemimlmmmnmompmsmumyn'nandnengninknonpnunyo'oaobocodoeofogohoiojokolomonoooporosotou
ovowoxoyozp'papcpepfphpiplpnpoprpsptpupypÃqaqsqurardrerhrirorurwryrÃs'sascseshsiskslsmsnsosp
sqsrstsusvswsyszt'tatbtctethtitotrtstutvtwtytzubudueuguhukulumunupurusutuzvavevivlvovrvuvywa
wewhwiwkwowrwuwvwyxaxexixlxsxtxuxvxxy'yayeyiylyoypyuyvyzzazezhzizlznzozuzvzw
Appendix 17b, Suffix List used in Algorithms XX1-XX7
In total there are 387 suffixes, also note that these suffixes are in reversed form so the segmentation would happen behind the first letter of the suffix.
'saaabacadaeafagahaiajakalamanaoaparasatauavawaxayazbabbbebiblbmbnbobrbubzcacd
cecicncocpcrcscud'dadcdddedhdidldndodrdudwdyeaebecedeeefegeheiekelemeneoeperes
eteuevewexeyezfafefffiflfnfofpfrfsfugagdgegggigkgngogrguhahchdhehghihkhnhohphr
hshthziaibicidieifigihiiijikiliminioipiqirisitiuiviwixiyizjejijjjnjokakckekikl
knkokrkskukwkyl'lalbldlelflhlilllmlolrlsltlulvlwlym'mamemgmhmimlmmmomrmsmumwmy
n'nandnengnhninlnmnnnonrntnunwnyoaobocodoeofogohoiojokolomonoooporosotouovoyoz
papepiplpmpoppprpsptpupzqaqcqeqnqoqur'rarcrdrergrhrirkrorrrsrtrurys'sasbscsdse
sfsgshsiskslsmsnsospsrssstsusvswsyt'tatbtctdtetfthtitltmtntotptrtstttutwtxtzua
ubudueufuhuiujukulumunuoupurusutuzvavevivovxwawewowtxaxexixlxnxoxrxuxxxyy'yayb
ycydyeyfygyhyiyjykylymynyoypyrysytyuyvywyxyzzazeziznzozrzsztzuzz¦Ã©ÃÃl