linguistic research with large annotated web...
TRANSCRIPT
Linguisti resear h with large annotated web orpora
Felix Bildhauer and Roland S häfer
German Grammar and Linguisti s (FU Berlin)
HPSG 2013 pre- onferen e tutorial
August 26, Berlin
Linguisti resear h with large annotated web orpora
Overview
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 1/80
Linguisti resear h with large annotated web orpora
Why use web orpora?
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 2/80
Linguisti resear h with large annotated web orpora
Why use web orpora?
Why orpora from the web?
�
Size: alleviates data sparseness problems with rare phenomena
�
Content: registers not found in traditional orpora
(forum dis ussions, blogs, fan-� tion et .)
�
Repli ability: resear h with stati orpus
vs. resear h based on sear h engine results
�
Availability: data an be lo ally pro essed with preferred
tools.
�
Cost: (virtually) free
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 3/80
Linguisti resear h with large annotated web orpora
Why use web orpora?
Some potential drawba ks of web orpora
�
Noise: from properties of web do uments
and from imperfe t pro essing
�
Meta data (of the kind linguists are interested in):
not en oded in web do uments in a reliable way
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 4/80
Linguisti resear h with large annotated web orpora
Web orpus onstru tion: Overview
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 5/80
Linguisti resear h with large annotated web orpora
Web orpus onstru tion: Overview
Web orpus onstru tion: work�ow
Data olle tion
ó
Removal of markup, s ripts et .
ó
Dete tion/removal of "`boilerplate"'
ó
Dete tion/removal of "`non-texts"'
ó
Dete tion/removal of (near-) dupli ates
ó
Linguisti post-pro essing
These steps involve design de isions
whi h a�e t the properties of the �nal orpus.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 6/80
Linguisti resear h with large annotated web orpora
Web orpus onstru tion: Overview
Work�ow II
Data olle tion
ó
Removal of markup, s ripts et .
ó
Dete tion/removal of "`boilerplate"'
ó
Dete tion/removal of "`non-texts"'
ó
Dete tion/removal of (near-) dupli ates
ó
Linguisti post-pro essing
�
Whi h sampling
pro edure/ rawling
strategy?
�
What should ount as
�boilerplate�?
�
Whi h do uments
ontribute �good text�?
�
Whi h amount of
dupli ation is a eptable?
�
How do we treat
non-standard orthography,
spelling errors, non-words
et .?
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 7/80
Linguisti resear h with large annotated web orpora
Data olle tion
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 8/80
Linguisti resear h with large annotated web orpora
Data olle tion
Links
�
The web onsists of pages and links between them.
�
Ea h page has an in-degree (no. of links to that page),
�
and an out-degree (no. of links on that page).
�
Pages with a higher in-degree are easier to �nd, but not
ne essarily the more interesting ones for orpus onstru tion.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 9/80
Linguisti resear h with large annotated web orpora
Data olle tion
The stru ture of the web
SCCIN OUT
TUBE
TENDRIL
[Manning et al., 2009, 427℄
Contrary to ommon intuition, Broder et al., 2000 found that the IN, OUT,
SCC, and TENDRIL omponents are not extremely di�erent in size.
A more detailed report on the sizes: Serrano et al. [2007℄.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 10/80
Linguisti resear h with large annotated web orpora
Data olle tion
Ina essible web ontent (deep web)
�
expli itly hidden from spiders
(Robots Ex lusion, intentional fooling/spoo�ng)
�
requiring login
(so ial networks, some forums, intranet gateways)
�
in-degree� 0
�
in-degree¡ 0, but too far away from any seed URL
(whi h are mostly sear h-engine indexed pages;
depends on rawling strategy)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 11/80
Linguisti resear h with large annotated web orpora
Data olle tion
Stati and dynami web ontent
�
By de�nition, stati web ontent is any page whi h is
not generated by a database/ ontent generation system,
but edited manually.
�
Today, most ontent is dynami (CMS, blogs, et .),
and the stati /dynami distin tion should not play
a great role in rawling for orpus onstru tion.
�
The problem is rather the separation of (stati or dynami )
linguisti ally relevant from irrelevant ontent
(time tables, produ t listings, et .).
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 12/80
Linguisti resear h with large annotated web orpora
Data olle tion
Top-level domains and national diale ts
�
It is ommon pra ti e to rawl national TLDs for ontent
in the orresponding national language.
�
Cook and Hirst [2012℄ on lude that the TLDs for . a
and .uk indeed represent national variants of English.
�
Problems with TLD rawls:
�
In some ountries, . om has a high prestige:
http://elpais. om, http://www.lavanguardia. om/,
http://www.mar a. om/
�
TLDs of ountries with more than one o� ial language
(e. g., Indonesia, Spain) yield relatively fewer results
in the target language.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 13/80
Linguisti resear h with large annotated web orpora
Data olle tion
Breadth-�rst rawling
Common pra ti e (e. g. WaCky, COW): follow all links on a page
Pro:
�
e�e tive: yields lots of data whithin a short time
�
open sour e software available
Con:
�
very sus eptible to page rank: no uniform random sample
�
unknown sampling bias
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 14/80
Linguisti resear h with large annotated web orpora
Data olle tion
Alternative: Random walks
Yet not used for web orpora: follow one link on a page
Pro:
�
known sampling bias: an be orre ted for
�
uniform random sample of web do uments
Con:
�
mu h less e�e tive than BF
�
probably not suitable for giga-token orpora
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 15/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 16/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Boilerplate dete tion
Boilerplate is text a ompanying the main text on a web page:
�
navigation bars
�
banners
�
advertisements
�
other layout elements
�
opyright notes
Boilerplate is typi ally not text produ ed by person on a parti ular
o asion, but rather
�
generated by ontent management systems
�
similar or identi al on many web pages of the same website
�
similar or identi al a ross web sites
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 17/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Why dete t boilerplate?
Boilerplate elements bias the frequen y of linguisti items in the
�nal orpus.
�
One of the most frequent tokens in an experimental German
web orpus: mehr `more', as in read more. . .
�
One of the most frequent senten es in an experimental English
web orpus: You are not allowed to post new ontent in the
forum.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 18/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Text
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 19/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Text (II)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 20/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Text (III)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 21/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate?
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 22/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (II)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 23/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (III)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 24/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (IV)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 25/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (V)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 26/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate?(VI)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 27/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (VII)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 28/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (VIII)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 29/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (IX)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 30/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
What should ount as boilerplate? (X)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 31/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Automati dete tion of boilerplate
Boilerplate must be dete ted automati ally.
Ma hine learning te hniques:
�
Manually annotate a number of paragraphs: boilerplate yes/no
�
Cal ulate a number of features from ea h paragraph
�
e. g., number of text hara ters, number of HTML tags,
position in HTML-do ument
�
Train a lassi�er to reprodu e the human's de isions
�
we use an arti� ial neural network (multilayer per eptron)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 32/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Automati dete tion of boilerplate (II)
Automati dete tion is not perfe t:
�
some boilerplate will not be dis overed (re all 1)
�
some paragraphs will be mistakenly lassi�ed as boilerplate
(pre ision 1)
Keep this in mind when working with web orpora,
and double he k any implausible frequen y �gures.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 33/80
Linguisti resear h with large annotated web orpora
Boilerplate dete tion
Boilerplate: removal vs. �agging
�Classi � approa h (e. g., WaCky, COW2012): remove boilerplate.
�
Alternative: do not remove, but �ag
(possibly with a on�den e s ore)
�
Pro � User do not have to rely on:
�
the orpus designer's de�nition of boilerplate
�
the performan e of an automati lassi�er
�
Con:
�
in rease in orpus size
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 34/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 35/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
Do ument �ltering
Ideally, the �nal orpus should ontain only �good� do uments.
�Good�:
�
only do uments in the target language
�
do uments ontaining predominantly text
(i. e., oherent and onne ted text)
This ex ludes ertain do ument types:
�
lists (e. g. ompany names, vo abulary items)
�
tag louds
�
et .
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 36/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
Example: a �good� do ument
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 37/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
Distinguishing the �good� from the �bad� (I)
�
Classi� ation has to be performed automati ally
�
ML algorithms usually need manually annotated training data
�
But: the de ision is di� ult even for humans and arbitrary to
some extent
Experiment: 3 raters, 1000 random do uments from German web
orpus, using 5-point s ale:
�
�2,�1: do ument should not be in the orpus
�
1, 2: do ument should be in the orpus
�
0: unde ided/do ument might or might not be in the orpus
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 38/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
Distinguishing the �good� from the �bad� (II)
Results (after training phase of rating of 100 do uments together,
with several hours of dis ussion of borderline ases):
statisti early 500 late 500 all 1,000
raw 0.566 0.300 0.433
κ (raw) 0.397 0.303 0.367
ICC pC , 1q 0.756 0.679 0.725
raw (r ¥ 0) 0.900 0.762 0.831
raw (r ¥ 1) 0.820 0.674 0.747
κ (r ¥ 0) 0.673 0.625 0.660
κ (r ¥ 1) 0.585 0.555 0.598
κ (r ¥ 2) 0.546 0.354 0.498
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 39/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
�Good� vs. �bad� do uments: lists
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 40/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
�Good� vs. �bad� do uments: lists (II)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 41/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
�Good� vs. �bad� do uments: lists (III)
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 42/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
A eptable results?
�
values below 0.68 [Krippendor�, 1980℄
�
even onsidering riti ism of Krippendorf's magi number
[Carletta, 1996, Bayerl and Paul, 2011℄:
un omfortably low for �gold standard�
�
more onfusion on late data (lower overall quality)
�
worse: disagreement between orpus designers
�
a eptan e at the threshold ¥ 0:
A: 78.4%, R: 73.8%, S: 84.9%
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 43/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
General method and idea
�
simple metri with known properties
�
language-independent, unsupervised. . .
does not involve an obviously di� ult design de ision
�
strategy for leansing: high re all for everyone,
a ept medio re pre ision
�
for retained do uments: use as annotation in �nal orpus,
allow orpus users to �set� pre ision
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 44/80
Linguisti resear h with large annotated web orpora
Do ument �ltering
Implementation
�
based on �frequent/short word� method
in language identi� ation [Grefenstette, 1995℄
�
similar to WaCky [Baroni et al., 2009℄
but without manually ompiled lists of fun tion words
�
totally unsupervised pro edure for rawled data
predominantly in a single language (TLD rawl):
�
training: get weighted mean and standard deviation
of relative frequen ies of the most frequent words (�pro�le�)
�
produ tion: al ulate for the top m of them
the �standardized� negative deviation for ea h do ument
�
lamped and added up: the Badness s ore
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 45/80
Linguisti resear h with large annotated web orpora
Dupli ation
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 46/80
Linguisti resear h with large annotated web orpora
Dupli ation
Dupli ate senten es in a orpus on ordan e
Query: made a donation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 47/80
Linguisti resear h with large annotated web orpora
Dupli ation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 48/80
Linguisti resear h with large annotated web orpora
Dupli ation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 49/80
Linguisti resear h with large annotated web orpora
Dupli ation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 50/80
Linguisti resear h with large annotated web orpora
Dupli ation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 51/80
Linguisti resear h with large annotated web orpora
Dupli ation
Near dupli ation and in lusion
�
Many web pages are not perfe t dupli ates of others,
but just very similar.
�
One typi al sour e: slightly edited texts from news agen ies
in several newspapers.
�
Also, some web pages fully ontain other web pages
(in lusion).
�
In lusion situation is made worse by typi al blog and ontent
management systems.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 52/80
Linguisti resear h with large annotated web orpora
Dupli ation
Failure of perfe t dupli ate hashing/�ngerprinting
D
1
: Yesterday, I wrote a letter to my parents.
D
2
: Yesterday I wrote a letter to my parents.
�
Hashes (e. g., SHA1 � very, maybe too strong for the given
task):
�
12d952 23 3869faead1e1aa6e02b98256e35f8
�
5ad8468f b45fe07ef420fd87 d360087514d1f1
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 53/80
Linguisti resear h with large annotated web orpora
Dupli ation
Solution � in prin iple: Ja ard Coe� ients
�
One an measure the similarity of two do uments as
the Ja ard Coe� ient of the sets of their n-grams
[Manning et al., 2009, 61,438℄.
�
Pro edure:
�
Create the do uments' word/token-n-grams.
�
Cal ulate the Ja ard Coe� ient of the two sets.
�
If it is above a ertain threshold, delete the shorter
of the two do uments.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 54/80
Linguisti resear h with large annotated web orpora
Dupli ation
Example: High similarity with JC over token-bi-grams I
Yesterday, I wrote a letter to my parents.
FPpD
1
q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),
(letter;to), (to;my), (my;parents), (parents;.)}
Yesterday I wrote a letter to my parents.
FPpD
2
q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter), (letter;to),
(to;my), (my;parents), (parents;.)}
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 55/80
Linguisti resear h with large annotated web orpora
Dupli ation
Example: High similarity with JC over token-bi-grams II
FPpD
1
q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),
(letter;to), (to;my), (my;parents), (parents;.)}
FPpD
2
q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),
(letter;to), (to;my), (my;parents), (parents;.)}
JpD
1
,D2
q �
|D
1
XD
2
|
|D
1
YD
2
|
�
7
10
� 0.7
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 56/80
Linguisti resear h with large annotated web orpora
Dupli ation
Example: Low similarity with JC over token-bi-grams
FPpD
2
q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),
(letter;to), (to;my), (my;parents), (parents;.)}
D
3
�Yesterday I read a book to my nie e.
FPpD
3
q �{(yesterday;i), (i,read), (read;a), (a;book), (book;to),
(to;my), (my;nie e), (nie e;.)}
JpD
2
,D3
q �
|D
2
XD
3
|
|D
2
YD
3
|
�
2
14
� 0.14
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 57/80
Linguisti resear h with large annotated web orpora
Dupli ation
Can it be done?
�
10,000,000 do uments
�
�
10,000,000
2
2
� 5� 10
13
omparisons
�
at 1µs per omparison: � 579 days
�
But 1µs is faster than available pro essors an do it.
Solution:
�
Do not al ulate JC for ea h pair of do uments.
�
Instead, estimate the JC [Broder et al., 1997℄.
�
Te hnique is known as w-Shingling.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 58/80
Linguisti resear h with large annotated web orpora
Dupli ation
Near-dupli ate dete tion with w-shingling
w-shingling does not tell us whi h amount of dupli ation
is a eptable in a orpus.
�
Setting a parti ular JC as a threshold for keeping/dis arding
a do ument is ultimately an arbitrary hoi e
�
Design de ision by orpus builders.
�
Web orpora will probably always ontain a ertain amount
of dupli ation
�
many other orpora do, too
�
Strategies to ope with dupli ation:
�
If ompatible with resear h question:
work with senten e-wise uniq'ed orpora.
�
Or work with unique (not too short) on ordan e lines.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 59/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 60/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Noise in web orpora: sour es
1. properties of web-do uments, e. g.
�
quasi-spontaneous writing situations with (often)
asual language, typos, non-standard spellings,
improper use of whitespa e and pun tuation
�
text ontributed by non-native speakers
�
a large lexi on with many out-of-vo abulary items
2. the stru ture of the WWW and its tte hnologies, e. g.
�
dupli ated text, boilerplate
3. short omings in post-pro essing, e. g.
�
HTML/ ode-stripping, tokenization
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 61/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Example: spelling variants
Corre t form (frequen y) Misspelled form N edit distan e
ubernimmt 81 1
überninmmt 6 1
übernimmnt 1 1
überniemt 6 1
öbernimmt 6 1
übernimnmt 5 1
übernimmet 11 1
übernimmt (297440) überniehmt 2 2
überniemt 6 1
übernihmt 17 1
überniiiiiiiimmmmmmmt 1 12
überniommt 4 1
übernimrnt 4 2
überniummt 2 1
überninnt 2 2
übernimmtt 2 1
. . .
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 62/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Example: text produ ed by non-native speakers
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 63/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Noise in linguisti post-pro essing
Most available natural language pro essing tools expe t standard
written language as input.
�
Many tokenizers expe t proper use of whitespa e,
asing and pun tuation.
�
Part-of-spee h taggers expe t properly tokenized text as input.
�
POS-taggers usually perform worse on unknown items
(tokenization errors, misspellings, out-of-vo abulary items).
�
Proper tokenization and POS tagging is normally the basis
for higher-level linguisti annotation
(parsing, sense disambiguation et .).
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 64/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Should noise be �normalized away�?
What exa tly is �noise� in the �rst pla e?
�
No general de�nition of �noise� that suits all purposes
�
What ounts as noise in one task may be valuable data
in another task
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 65/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Can noise be �normalized away�?
In large web orpora, noise would have to be eliminated
automati ally, without human intera tion.
�
viable in some ases:
�
e. g. orre ting run-together words
a lot.I don't think Ñ a lot. I don't think
�
but not an easy task in other ases:
�
e. g. spelling orre tion
I didn't spend mu h time with Bibble. Ñ Bubble?
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 66/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Non-destru tive normalization
Our approa h: do as mu h as possible non-destru tively
(i. e. do not alter the original data).
�
Normalization-as-annotation
�
But there are limits to this approa h imposed
by the indexing/query ar hite ture.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 67/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Non-destru tive orthographi normalization
Word POS Lemma
The DT the
FA NP FA
does VBZ do
abosolutley JJ unknown¡
nothing NN nothing
to TO to
help VB help
Clubs NNS lub
Word POS Lemma
The DT the
FA NP FA
does VBZ do
norm from="abosolutley"¡
absolutely ADV absolutely
/norm¡
nothing NN nothing
to TO to
help VB help
Clubs NNS lub
Original text: spelling error,
POS-tagging error, no lemma
Normalized text: standard spelling,
orre t POS and lemma, original
spelling preserved as meta data
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 68/80
Linguisti resear h with large annotated web orpora
Linguisti post-pro essing
Quality of linguisti annotation in web orpora
Automati annotation may be on ern, e. g.
�
Do not take type ounts in web orpora at fa e value.
�
There are many hapax legomena due to tokenization issues.
�
POS tagging of web orpora does not quite rea h
a ura y levels in the upper 90%
�
but many POS taggers do not rea h su h a ura y levels
on traditional, out-of-domain texts either
But:
�
Depending on the spe i� resear h question,
some or all of these problems by irrelevant.
�
Anyway, in many s enarios, there is no alternative
to working with web orpora.
�
And we (and others) are working to improve it.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 69/80
Linguisti resear h with large annotated web orpora
Evaluation
We are here. . .
Why use web orpora?
Web orpus onstru tion: Overview
Data olle tion
Boilerplate dete tion
Do ument �ltering
Dupli ation
Linguisti post-pro essing
Evaluation
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 70/80
Linguisti resear h with large annotated web orpora
Evaluation
What's in a web orpus
Crawled orpus:
�
No strati�ed sampling of do uments
�
Exa t omposition of the �nal orpus not known beforehand.
�
Establish orpus omposition after orpus onstru tion.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 71/80
Linguisti resear h with large annotated web orpora
Evaluation
Web do ument lassi� ation
�
COWCat: lassi� ation s heme based on Sharo� [2006℄
�
Five dimensions:
Authorship, Mode, Audien e, Aim, Domain
�
Annotation guidelines available from:
http://hpsg.fu-berlin.de/ ow/files/ ow at2013.pdf
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 72/80
Linguisti resear h with large annotated web orpora
Evaluation
UKCOW2012: Mode
Mode
Type % 95% CI �%
Written 90.5 2.8
Spoken 0.7 0.8
Quasi-Spont. 6.6 2.4
Blogmix 2.2 1.4
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 73/80
Linguisti resear h with large annotated web orpora
Evaluation
UKCOW2012: Audien e
Audien e
Type % 95% CI �%
General 87.4 3.2
Informed 7.3 2.5
Professional 5.3 2.2
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 74/80
Linguisti resear h with large annotated web orpora
Evaluation
UKCOW2012: Authorship
Authorship
Type % 95% CI �%
Single, female 7.8 2.6
Single, male 21.6 4.0
Multiple 13.8 3.3
Corporate 26.0 4.2
Unknown 30.8 4.5
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 75/80
Linguisti resear h with large annotated web orpora
Evaluation
UKCOW2012: Aim
Aim
Type % 95% CI �%
Re ommendation 10.2 2.9
Instru tion 1.9 1.3
Information 10.7 3.0
Dis ussion 75.5 4.2
Fi tion 1.7 1.2
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 76/80
Linguisti resear h with large annotated web orpora
Evaluation
Extrinsi evaluation: Biemann et al. [2013℄
Task oriented evaluation: ollo ation extra tion
�
Work by Stefan Evert
�
Di�erent orpora (web and traditional) ompaired
against a gold standard
size POS basi
name (tokens) orpus type tagged lemmatized unit
BNC 0.1 G referen e orpus � � text
WP500 0.2 G Wikipedia � � fragment
Wa kypedia 1.0 G Wikipedia � � arti le
ukWaC 2.1 G web orpus � � web page
WebBase 3.3 G web orpus � � paragraph
UKCOW 4.0 G web orpus � � senten e
LCC 0.9 G web orpus � � senten e
LCC (f ¥ k) 0.9 G web n-grams � � n-gram
Web1T5 (f ¥ 40) 1000.0 G web n-grams � � n-gram
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 77/80
Linguisti resear h with large annotated web orpora
Evaluation
Verb-parti le ombinations Biemann et al. [2013℄ (II)
�
Gold standard: 3,078 English verb-parti le ombinations,
manually lassi�ed as non- ompositional ( arry on, kno k out)
or ompositional bring together, peer out)
�
Extra ted o-o urren e ounts for the word pairs
(3-word span to the right of the verb)
�
Clean, annotated web orpora seem to be a valid repla ement
for a traditional referen e orpus su h as the BNC.
�
Diversity may be a relevant fa tor: Web orpora of more than
1G words are needed to rival the 100M words BNC.
�
Web1T5: sheer size annot make up for �messy� ontent
and la k of annotation.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 78/80
Linguisti resear h with large annotated web orpora
Evaluation
Referen es I
M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very
large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):
209�226, 2009.
P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a
meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.
C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,
and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. L. Wiener. Graph
stru ture in the web. In Pro eedings of the 9th International World Wide Web onferen e on
Computer Networks: The International Journal of Computer and Tele ommuni ations Networking,
pages 309�320. North-Holland Publishing Co, 2000.
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Synta ti lustering of the Web. Te hni al
Note 1997-115, SRC, Palo Alto, July 25 1997.
J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,
22(2):249�254, 1996.
P. Cook and G. Hirst. Do web- orpora from top-level domains represent national varieties of English? In
Pro eedings of the 11th International Conferen e on the Statisti al Analysis of Textual Data, pages
281�293, Liège, 2012.
G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation
onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.
K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly
Hills, 1980.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 79/80
Linguisti resear h with large annotated web orpora
Evaluation
Referen es II
C. Manning, P. Raghavan, and H. S hütze. An Introdu tion to Information Retrieval. CUP, Cambridge,
2009.
M. A. Serrano, A. Maguitman, M. Boguñá, S. Fortunato, and A. Vespignani. De oding the stru ture of
the WWW: A omparative analysis of Web rawls. ACM Trans. Web, 1(2), 2007.
S. Sharo�. Creating general-purpose orpora using automated sear h engine queries. In M. Baroni and
S. Bernardini, editors, WaCky! Working papers on the Web as Corpus, pages 63�98. GEDIT,
Bologna, 2006.
Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 80/80