an unsupervised approach for the detection of outliers in corpora david guthrie louise guthire,...

30
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Upload: loren-hart

Post on 17-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

An Unsupervised Approach for the Detection of Outliers in Corpora

David GuthrieLouise Guthire, Yorick Wilks

The University of Sheffield

Page 2: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Corpora in CL

• Increasingly common in computational linguistics to use textual resources gathered automatically

o IR, scraping Web, etc.

• Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

Page 3: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Corpora Can Contain Errors

• IR and scraping can lead to errors in precision

• Can contain entries that might be considered spam:

o Advertisingo gibberish messageso (more subtly) information that is an opinion

rather than a fact, rants about political figures

Page 4: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Difficult to verify

• The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc.

• Creation and validation of corpora has generally relied on humans

Page 5: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Goals

• Improve the consistency and quality of corpora

• Automatically identify and remove text from corpora that does not belong

Page 6: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Approach

• Treat the problem as a type of outlier detection

• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Page 7: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Method

• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features

• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Page 8: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Feature Matrix

X

Represent each piece of text as a vector of features

Page 9: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Characterizing Text

• 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)

o Simple Surface Featureso Readability Measureso POS Distributions (RASP)o Vocabulary Obscurityo Emotional Affect (General Inquirer Dictionary)

Page 10: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Feature Matrix

X

Identify outlying Text

Page 11: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Outliers are ‘hidden’

Page 12: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

SDE

• Use the Stahel-Donoho Estimator (SDE) to identify outliers

o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension

o For every piece of text, the goal is to find a projection of the that maximizes its robust z-score

o Especially suited to data with a large number of dimensions (features)

Page 13: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 14: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Robust Zscore of furthest point is <3

Page 15: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 16: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 17: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Robust z score for triangles in thisProjection is >12 std dev

Page 18: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 19: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 20: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 21: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 22: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

SDE

• Where a is a direction (unit length vector) and

• xia is the projection of row xi onto direction a

• mad is the median absolute deviation

Page 23: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Outliers have a large SD

• The distances for each piece of text SD(xi) are then sorted and all pieces of text above a cutoff are marked as outliers

• We use

Page 24: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Experiments

• In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’

• Measure the accuracy of automatically identifying the inserted segment as an outlier

• We varied the size of the pieces of text from 100 to 1000 words

Page 25: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Anarchist Cookbook

• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')

• Randomly one segment from the Anarchist Cookbook and attempt to identify outliers This is repeated 200 times for each segment size (100, 500,

and 1,000 words)

Page 26: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield
Page 27: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Cookbook Results

• Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

Page 28: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Machine Translations

• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine

• Similar genre to English newswire but translations are far from perfect and so the language use is very odd

• 200 test collections are created for each segment size as before

Page 29: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

MT Results

Page 30: An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Conclusions and Future Work

• Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) Automatically clean corpora Does not require training data or human annotation

• This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall

• Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers