wilson wong, wei liu and mohammed bennamoun school of computer science and software engineering...
TRANSCRIPT
Wilson Wong, Wei Liu and Mohammed BennamounSchool of Computer Science and Software Engineering
University of Western Australia
Enhanced Integrated Scoring for Cleaning Dirty Texts (ISSAC v2)
• Authors: Wilson Wong, Wei Liu and Mohammed Bennamoun (University of Western Australia)
• Presented By: Benjamin Johnston (University of Technology, Sydney)
INTRODUCTIONSINTRODUCTIONS
1.Background
2.Problems & Challenges
3.Solution
4.Evaluations
5.Future Works
INDEXINDEX
itmetime, with i and t
swappeditem, with m and e
swapped
ITME, Institute of Electronics Materials
Technology in Warsaw, Poland
it’s me, with missing ’s
BACKGROUNDBACKGROUND
• These three errors are interrelated: Splling erors Abbre IMPROPER cAsing
• Research mostly (traditionally) carried out separately.
3 types of errors
BACKGROUNDBACKGROUND
• Spelling error detection and correction: Minimum edit distance (Damerau-Levenshtein,
Wagner-Fisher, etc.) Similarity key (SOUNDEX, Metaphone, Double
Metaphone, Daitch-Mokotoff, etc.)
• Abbreviation expansion: most research carried out in the area of named-entity recognition. Rely on: Letter casing. E.g. “NASA” Use of periods. E.g. “U.S.A.” Use of parentheses. E.g. “North Atlantic Treaty
Organisation (NATO)” Number of letters in words.
Spelling error and abbreviation
BACKGROUNDBACKGROUND
Letter casing
• Case restoration: improper casing in words are detected and restored. Common approaches include: Use N-grams to predict the most likely case (LC,
MC, UC) of a token based on its local context.
Rely on unambiguous introduction of ambiguous tokens. The ambiguity of “Riders” will reduce when we encounter “John Riders” in the same text.
newinformation
york
subsequent token
likely to be LCcategorize
into LC less likely to be LC
INDEX
PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES
• Test data are either artificial or not-so-dirty dirty text.• Techniques are isolated.
Existing techniques, their accuracies and test data
np, ty
Example of dirty texts
PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES
• Ad-hoc abbreviations, common in the Internet era, pose extra challenges (e.g. “ty”, “u”).
Mi Teaser constantly REMINDS mer that eduction is an inerrant asper of LIFO. She sad, "Few yrs in school will ensue a beater LIFO for u". 2/16 [Aspell 0.50.3][Aspell 0.50.3]
MI Teacher kinsman REMINDS meek that education is an important speak of life. She sad, "Few yes in Scholl will ensure a better LIFO for U". 5/16 [htp://www.spellcheck.net][htp://www.spellcheck.net]
Mi Teacher constantly REMINDS me that education is an important aspect of life. She sad, "Few yrs in Scholl will ensure a better LIFO for u". 8/16 [[MS Office Word 2003]MS Office Word 2003]
Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original][Original]
Examples of existing applications
PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES
• Techniques for abbrev. expansion, etc based on patterns and static dictionary face problems with expansion.
• Integrated approaches for automatically correcting all three types of errors are rare.
• The accuracy of corrections by the existing isolated techniques can be further improved.
• The accuracy of existing techniques (individual or integrated) on extremely challenging dirty texts (e.g. chat records) has yet to be demonstrated.
PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES
Challenges to be addressed
INDEX
SOLUTIONSOLUTION
ISSAC v2
Suggestions and rank by Aspell
Expansions for abbreviations by
Stands4.com
Google’s page count and spell check
Domain corpora (i.e. dirty texts collection)
• Our solution must put into consideration the followings: Integrated approach (for all 3 types of errors) High accuracy Automatic (i.e. no user involvement) Evaluations using real-world dirty texts
Overview
SOLUTION
Aspell
• A term is fed into Aspell and a list of suggestions for each error term will be generated.
SOLUTIONSOLUTION
Stands4.com
• Stands4.com is consulted for possible expansions for each erroneous term.
• Local copy is maintained for future use.
SOLUTIONSOLUTION
• Google’s ability to search for phrases• The page count that Google returns • Google’s suggestions for spelling errors in queries.
SOLUTIONSOLUTION
m expansions, all with rank 1
n suggestions by Aspell, according to their original rank
the error term itself
= jth suggestion with rank i in the set S
Notations
Google’s suggestion
SOLUTIONSOLUTION
Notations
itme
time item Institute of Electronics Materials
Technology
…
• We use the neighbouring words to disambiguate and identify the most ideal suggestion from S for automatic correction.
• The left and right words are considered as context.
“shipping itmeitme frame”
Left word, l = “shipping”Right word, r = “frame”
SOLUTIONSOLUTION
ISSAC v2
Reuse factor, RF(e,si,j) {0, 1}
Abbreviation factor, AF(e,si,j) {0, 1}
Domain significance, DS(l,si,j,r) (0,1)
General significance, GS(l,si,j,r) (0,1)
Normalized edit distance,
NED(e,si,j) (0,1]
Original rank by Aspell, i-1 (0,1]
Different weights in ISSAC
SOLUTIONSOLUTION
• The list of suggestions S is re-ranked using
• Individual weights contribute to the overall ranking of each suggestion.
• Suggestion with highest NS is taken as the most ideal replacement given the surrounding context.
Correction using ISSAC
SOLUTIONSOLUTION
• Heuristic: correct replacement should not deviate too far from the error.
itme
item
timeit me
timer
Tim
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Edit distance
NE
D
Edit distance
SOLUTIONSOLUTION
Reuse and abbreviation factors
• If a suggestion is a potential expansion for an abbrev. (i.e. error term), AF will yield 1 and 0 otherwise.
• The abbreviation dictionary is consulted.
• Return 1 if suggestions appear in spelling dictionary.• Two types of entries in the spelling dictionary.
Suggestions by Google for spelling errors. Automatically updated every time Google suggest a replacement for an error.
Suggestions for errors provided by users (optional)
SOLUTIONSOLUTION
sj,i is not common both individually and in context
sj,i occurs very frequent, both individually and in context but nearly all documents contain
the term (i.e. too common)sj,i occurs very frequent, and appears exclusively only in
few documents
A
B
C
D
where B, D > 0
0lim0
DSA
3.0lim),(),(
DSDBCA
1lim)1,(),(
DSBCA
Domain significance
SOLUTIONSOLUTION
A
B
C
D
where B, D > 0 and B < D
sj,i appears very rarely in context
sj,i, appears often in context, appears often individually (i.e.
term is very common)sj,i appears often in context,
individual appearance approaches appearance in
context (i.e. term is exclusive to the context)
0lim0
GSA
3.0lim),(),(
GSDBCA
1lim),(),(
GSABCA
General significance
INDEX
EVALUATIONSEVALUATIONS
Accuracy of ISSAC
• Evaluation data (700 chat sessions, 3313 errors) are actual chat records between agents and customers provided by 247Customer.com.
aspellbyidentifiederrortotal
treplacemencorrect
N
NAccuracy
____
_
EVALUATIONSEVALUATIONS
Accuracy of ISSAC
EVALUATIONSEVALUATIONS
• Cause 1 (≈0.8%):The accuracy of correction by ISSAC is bounded by
the coverage of S produced by Aspell.Due to the absence of the correct replacement from
the list of suggestions produced by Aspell.For example, the correct replacement for “dotn” is
not present in the list of suggestion by Aspell.
When ISSAC doesn’t work
EVALUATIONSEVALUATIONS
• Cause 2 (≈0.7%): Due to two flaws related to l and r:
Neighbouring words are not correctly spelt. Example, “morel iberal return”.
The left and right words are inadequate. Example, “both ocats <”.
• Cause 3 (≈0.5%): Two anomalies where ISSAC does not apply:
Suggestions who are equally likely to be the correct replacement. Example, “Cheng” or “Cheung” in the context of “Janice Cheng <”.
Contrasting disagreement among weights.
When ISSAC doesn’t work
INDEX
FUTURE WORKSFUTURE WORKS
My teacher constantly reminds me that education is an important aspect of life. She said, “Few years in school will ensure a better Life for you". 15/16 [ISSAC v2][ISSAC v2]
Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original][Original]
• Look for solutions to overcome the 3 causes to improve the accuracy.
• Carry out evaluations on larger data sets.
• Evaluate ISSAC in terms of time complexity.
THANK YOUTHANK YOU
• Widely adopted classes of techniques for detecting and correcting spelling errors: Minimum edit distance Similarity key (phonetic algorithms)
• Minimum edit distance: minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Example: “wear” → “beard” require a minimum of
2 operations.
Damerau-Levenshtein, Wagner-Fisher, etc.
BACKGROUNDBACKGROUND
Spelling error
substitute ‘w’ with ‘b’ insert ‘d’ beardwear bear
BACKGROUNDBACKGROUND
• Similarity key: map every string into a key such that similarly spelled strings will have identical keys. The key, computed for each spelling error, will act
as a pointer to all similarly pronounced words (i.e. soundslike) in the dictionary.
SOUNDEX, Metaphone, Double Metaphone, etc.
wear → w006 → w6
ware → w060 → w6
Spelling error