linguistic research with large annotated web...

Linguisti resear h with large annotated web orpora

Felix Bildhauer and Roland S häfer

German Grammar and Linguisti s (FU Berlin)

HPSG 2013 pre- onferen e tutorial

August 26, Berlin


Overview

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 1/80


Why use web orpora?

We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation



Why use web orpora?

Why orpora from the web?

�

Size: alleviates data sparseness problems with rare phenomena

�

Content: registers not found in traditional orpora

(forum dis ussions, blogs, fan-� tion et .)

�

Repli ability: resear h with stati orpus

vs. resear h based on sear h engine results

�

Availability: data an be lo ally pro essed with preferred

tools.

�

Cost: (virtually) free



Why use web orpora?

Some potential drawba ks of web orpora

�

Noise: from properties of web do uments

and from imperfe t pro essing

�

Meta data (of the kind linguists are interested in):

not en oded in web do uments in a reliable way




We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation




Web orpus onstru tion: work�ow

Data olle tion

ó

Removal of markup, s ripts et .

ó

Dete tion/removal of "`boilerplate"'

ó

Dete tion/removal of "`non-texts"'

ó

Dete tion/removal of (near-) dupli ates

ó


These steps involve design de isions

whi h a�e t the properties of the �nal orpus.




Work�ow II

Data olle tion

ó

Removal of markup, s ripts et .

ó

Dete tion/removal of "`boilerplate"'

ó

Dete tion/removal of "`non-texts"'

ó

Dete tion/removal of (near-) dupli ates

ó


�

Whi h sampling

pro edure/ rawling

strategy?

�

What should ount as

�boilerplate�?

�

Whi h do uments

ontribute �good text�?

�

Whi h amount of

dupli ation is a eptable?

�

How do we treat

non-standard orthography,

spelling errors, non-words

et .?



Data olle tion

We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation



Data olle tion

Links

�

The web onsists of pages and links between them.

�

Ea h page has an in-degree (no. of links to that page),

�

and an out-degree (no. of links on that page).

�

Pages with a higher in-degree are easier to �nd, but not

ne essarily the more interesting ones for orpus onstru tion.



Data olle tion

The stru ture of the web

SCCIN OUT

TUBE

TENDRIL

[Manning et al., 2009, 427℄

Contrary to ommon intuition, Broder et al., 2000 found that the IN, OUT,

SCC, and TENDRIL omponents are not extremely di�erent in size.

A more detailed report on the sizes: Serrano et al. [2007℄.



Data olle tion

Ina essible web ontent (deep web)

�

expli itly hidden from spiders

(Robots Ex lusion, intentional fooling/spoo�ng)

�

requiring login

(so ial networks, some forums, intranet gateways)

�

in-degree� 0

�

in-degree¡ 0, but too far away from any seed URL

(whi h are mostly sear h-engine indexed pages;

depends on rawling strategy)



Data olle tion

Stati and dynami web ontent

�

By de�nition, stati web ontent is any page whi h is

not generated by a database/ ontent generation system,

but edited manually.

�

Today, most ontent is dynami (CMS, blogs, et .),

and the stati /dynami distin tion should not play

a great role in rawling for orpus onstru tion.

�

The problem is rather the separation of (stati or dynami )

linguisti ally relevant from irrelevant ontent

(time tables, produ t listings, et .).



Data olle tion

Top-level domains and national diale ts

�

It is ommon pra ti e to rawl national TLDs for ontent

in the orresponding national language.

�

Cook and Hirst [2012℄ on lude that the TLDs for . a

and .uk indeed represent national variants of English.

�

Problems with TLD rawls:

�

In some ountries, . om has a high prestige:

http://elpais. om, http://www.lavanguardia. om/,

http://www.mar a. om/

�

TLDs of ountries with more than one o� ial language

(e. g., Indonesia, Spain) yield relatively fewer results

in the target language.


http://elpais.com

http://www.lavanguardia.com/

http://www.marca.com/


Data olle tion

Breadth-�rst rawling

Common pra ti e (e. g. WaCky, COW): follow all links on a page

Pro:

�

e�e tive: yields lots of data whithin a short time

�

open sour e software available

Con:

�

very sus eptible to page rank: no uniform random sample

�

unknown sampling bias



Data olle tion

Alternative: Random walks

Yet not used for web orpora: follow one link on a page

Pro:

�

known sampling bias: an be orre ted for

�

uniform random sample of web do uments

Con:

�

mu h less e�e tive than BF

�

probably not suitable for giga-token orpora




We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation





Boilerplate is text a ompanying the main text on a web page:

�

navigation bars

�

banners

�

advertisements

�

other layout elements

�

opyright notes

Boilerplate is typi ally not text produ ed by person on a parti ular

o asion, but rather

�

generated by ontent management systems

�

similar or identi al on many web pages of the same website

�

similar or identi al a ross web sites




Why dete t boilerplate?

Boilerplate elements bias the frequen y of linguisti items in the

�nal orpus.

�

One of the most frequent tokens in an experimental German

web orpus: mehr `more', as in read more. . .

�

One of the most frequent senten es in an experimental English

web orpus: You are not allowed to post new ontent in the

forum.




Text




Text (II)




Text (III)




What should ount as boilerplate?




What should ount as boilerplate? (II)




What should ount as boilerplate? (III)




What should ount as boilerplate? (IV)




What should ount as boilerplate? (V)




What should ount as boilerplate?(VI)




What should ount as boilerplate? (VII)




What should ount as boilerplate? (VIII)




What should ount as boilerplate? (IX)




What should ount as boilerplate? (X)




Automati dete tion of boilerplate

Boilerplate must be dete ted automati ally.

Ma hine learning te hniques:

�

Manually annotate a number of paragraphs: boilerplate yes/no

�

Cal ulate a number of features from ea h paragraph

�

e. g., number of text hara ters, number of HTML tags,

position in HTML-do ument

�

Train a lassi�er to reprodu e the human's de isions

�

we use an arti� ial neural network (multilayer per eptron)




Automati dete tion of boilerplate (II)

Automati dete tion is not perfe t:

�

some boilerplate will not be dis overed (re all 1)

�

some paragraphs will be mistakenly lassi�ed as boilerplate

(pre ision 1)

Keep this in mind when working with web orpora,

and double he k any implausible frequen y �gures.




Boilerplate: removal vs. �agging

�Classi � approa h (e. g., WaCky, COW2012): remove boilerplate.

�

Alternative: do not remove, but �ag

(possibly with a on�den e s ore)

�

Pro � User do not have to rely on:

�

the orpus designer's de�nition of boilerplate

�

the performan e of an automati lassi�er

�

Con:

�

in rease in orpus size



Do ument �ltering

We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation



Do ument �ltering

Do ument �ltering

Ideally, the �nal orpus should ontain only �good� do uments.

�Good�:

�

only do uments in the target language

�

do uments ontaining predominantly text

(i. e., oherent and onne ted text)

This ex ludes ertain do ument types:

�

lists (e. g. ompany names, vo abulary items)

�

tag louds

�

et .



Do ument �ltering

Example: a �good� do ument



Do ument �ltering

Distinguishing the �good� from the �bad� (I)

�

Classi� ation has to be performed automati ally

�

ML algorithms usually need manually annotated training data

�

But: the de ision is di� ult even for humans and arbitrary to

some extent

Experiment: 3 raters, 1000 random do uments from German web

orpus, using 5-point s ale:

�

�2,�1: do ument should not be in the orpus

�

1, 2: do ument should be in the orpus

�

0: unde ided/do ument might or might not be in the orpus



Do ument �ltering

Distinguishing the �good� from the �bad� (II)

Results (after training phase of rating of 100 do uments together,

with several hours of dis ussion of borderline ases):

statisti early 500 late 500 all 1,000

raw 0.566 0.300 0.433

κ (raw) 0.397 0.303 0.367

ICC pC , 1q 0.756 0.679 0.725

raw (r ¥ 0) 0.900 0.762 0.831

raw (r ¥ 1) 0.820 0.674 0.747

κ (r ¥ 0) 0.673 0.625 0.660

κ (r ¥ 1) 0.585 0.555 0.598

κ (r ¥ 2) 0.546 0.354 0.498



Do ument �ltering

�Good� vs. �bad� do uments: lists



Do ument �ltering

�Good� vs. �bad� do uments: lists (II)



Do ument �ltering

�Good� vs. �bad� do uments: lists (III)



Do ument �ltering

A eptable results?

�

values below 0.68 [Krippendor�, 1980℄

�

even onsidering riti ism of Krippendorf's magi number

[Carletta, 1996, Bayerl and Paul, 2011℄:

un omfortably low for �gold standard�

�

more onfusion on late data (lower overall quality)

�

worse: disagreement between orpus designers

�

a eptan e at the threshold ¥ 0:

A: 78.4%, R: 73.8%, S: 84.9%



Do ument �ltering

General method and idea

�

simple metri with known properties

�

language-independent, unsupervised. . .

does not involve an obviously di� ult design de ision

�

strategy for leansing: high re all for everyone,

a ept medio re pre ision

�

for retained do uments: use as annotation in �nal orpus,

allow orpus users to �set� pre ision



Do ument �ltering

Implementation

�

based on �frequent/short word� method

in language identi� ation [Grefenstette, 1995℄

�

similar to WaCky [Baroni et al., 2009℄

but without manually ompiled lists of fun tion words

�

totally unsupervised pro edure for rawled data

predominantly in a single language (TLD rawl):

�

training: get weighted mean and standard deviation

of relative frequen ies of the most frequent words (�pro�le�)

�

produ tion: al ulate for the top m of them

the �standardized� negative deviation for ea h do ument

�

lamped and added up: the Badness s ore



Dupli ation

We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation



Dupli ation

Dupli ate senten es in a orpus on ordan e

Query: made a donation



Dupli ation



Dupli ation

Near dupli ation and in lusion

�

Many web pages are not perfe t dupli ates of others,

but just very similar.

�

One typi al sour e: slightly edited texts from news agen ies

in several newspapers.

�

Also, some web pages fully ontain other web pages

(in lusion).

�

In lusion situation is made worse by typi al blog and ontent

management systems.



Dupli ation

Failure of perfe t dupli ate hashing/�ngerprinting

D

1

: Yesterday, I wrote a letter to my parents.

D

2

: Yesterday I wrote a letter to my parents.

�

Hashes (e. g., SHA1 � very, maybe too strong for the given

task):

�

12d952 23 3869faead1e1aa6e02b98256e35f8

�

5ad8468f b45fe07ef420fd87 d360087514d1f1



Dupli ation

Solution � in prin iple: Ja ard Coe� ients

�

One an measure the similarity of two do uments as

the Ja ard Coe� ient of the sets of their n-grams

[Manning et al., 2009, 61,438℄.

�

Pro edure:

�

Create the do uments' word/token-n-grams.

�

Cal ulate the Ja ard Coe� ient of the two sets.

�

If it is above a ertain threshold, delete the shorter

of the two do uments.



Dupli ation

Example: High similarity with JC over token-bi-grams I

Yesterday, I wrote a letter to my parents.

FPpD

1

q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),

(letter;to), (to;my), (my;parents), (parents;.)}

Yesterday I wrote a letter to my parents.

FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter), (letter;to),

(to;my), (my;parents), (parents;.)}



Dupli ation

Example: High similarity with JC over token-bi-grams II

FPpD

1

q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),


FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),


JpD

1

,D2

q �

|D

1

XD

2

|

|D

1

YD

2

|

�

7

10

� 0.7



Dupli ation

Example: Low similarity with JC over token-bi-grams

FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),


D

3

�Yesterday I read a book to my nie e.

FPpD

3

q �{(yesterday;i), (i,read), (read;a), (a;book), (book;to),

(to;my), (my;nie e), (nie e;.)}

JpD

2

,D3

q �

|D

2

XD

3

|

|D

2

YD

3

|

�

2

14

� 0.14



Dupli ation

Can it be done?

�

10,000,000 do uments

�

�

10,000,000

2

2

� 5� 10

13

omparisons

�

at 1µs per omparison: � 579 days

�

But 1µs is faster than available pro essors an do it.

Solution:

�

Do not al ulate JC for ea h pair of do uments.

�

Instead, estimate the JC [Broder et al., 1997℄.

�

Te hnique is known as w-Shingling.



Dupli ation

Near-dupli ate dete tion with w-shingling

w-shingling does not tell us whi h amount of dupli ation

is a eptable in a orpus.

�

Setting a parti ular JC as a threshold for keeping/dis arding

a do ument is ultimately an arbitrary hoi e

�

Design de ision by orpus builders.

�

Web orpora will probably always ontain a ertain amount

of dupli ation

�

many other orpora do, too

�

Strategies to ope with dupli ation:

�

If ompatible with resear h question:

work with senten e-wise uniq'ed orpora.

�

Or work with unique (not too short) on ordan e lines.




We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation




Noise in web orpora: sour es

1. properties of web-do uments, e. g.

�

quasi-spontaneous writing situations with (often)

asual language, typos, non-standard spellings,

improper use of whitespa e and pun tuation

�

text ontributed by non-native speakers

�

a large lexi on with many out-of-vo abulary items

2. the stru ture of the WWW and its tte hnologies, e. g.

�

dupli ated text, boilerplate

3. short omings in post-pro essing, e. g.

�

HTML/ ode-stripping, tokenization




Example: spelling variants

Corre t form (frequen y) Misspelled form N edit distan e

ubernimmt 81 1

überninmmt 6 1

übernimmnt 1 1

überniemt 6 1

öbernimmt 6 1

übernimnmt 5 1

übernimmet 11 1

übernimmt (297440) überniehmt 2 2

überniemt 6 1

übernihmt 17 1

überniiiiiiiimmmmmmmt 1 12

überniommt 4 1

übernimrnt 4 2

überniummt 2 1

überninnt 2 2

übernimmtt 2 1

. . .




Example: text produ ed by non-native speakers




Noise in linguisti post-pro essing

Most available natural language pro essing tools expe t standard

written language as input.

�

Many tokenizers expe t proper use of whitespa e,

asing and pun tuation.

�

Part-of-spee h taggers expe t properly tokenized text as input.

�

POS-taggers usually perform worse on unknown items

(tokenization errors, misspellings, out-of-vo abulary items).

�

Proper tokenization and POS tagging is normally the basis

for higher-level linguisti annotation

(parsing, sense disambiguation et .).




Should noise be �normalized away�?

What exa tly is �noise� in the �rst pla e?

�

No general de�nition of �noise� that suits all purposes

�

What ounts as noise in one task may be valuable data

in another task




Can noise be �normalized away�?

In large web orpora, noise would have to be eliminated

automati ally, without human intera tion.

�

viable in some ases:

�

e. g. orre ting run-together words

a lot.I don't think Ñ a lot. I don't think

�

but not an easy task in other ases:

�

e. g. spelling orre tion

I didn't spend mu h time with Bibble. Ñ Bubble?




Non-destru tive normalization

Our approa h: do as mu h as possible non-destru tively

(i. e. do not alter the original data).

�

Normalization-as-annotation

�

But there are limits to this approa h imposed

by the indexing/query ar hite ture.




Non-destru tive orthographi normalization

Word POS Lemma

The DT the

FA NP FA

does VBZ do

abosolutley JJ unknown¡

nothing NN nothing

to TO to

help VB help

Clubs NNS lub

Word POS Lemma

The DT the

FA NP FA

does VBZ do

norm from="abosolutley"¡

absolutely ADV absolutely

/norm¡

nothing NN nothing

to TO to

help VB help

Clubs NNS lub

Original text: spelling error,

POS-tagging error, no lemma

Normalized text: standard spelling,

orre t POS and lemma, original

spelling preserved as meta data




Quality of linguisti annotation in web orpora

Automati annotation may be on ern, e. g.

�

Do not take type ounts in web orpora at fa e value.

�

There are many hapax legomena due to tokenization issues.

�

POS tagging of web orpora does not quite rea h

a ura y levels in the upper 90%

�

but many POS taggers do not rea h su h a ura y levels

on traditional, out-of-domain texts either

But:

�

Depending on the spe i� resear h question,

some or all of these problems by irrelevant.

�

Anyway, in many s enarios, there is no alternative

to working with web orpora.

�

And we (and others) are working to improve it.



Evaluation

We are here. . .

Why use web orpora?


Data olle tion


Do ument �ltering

Dupli ation


Evaluation



Evaluation

What's in a web orpus

Crawled orpus:

�

No strati�ed sampling of do uments

�

Exa t omposition of the �nal orpus not known beforehand.

�

Establish orpus omposition after orpus onstru tion.



Evaluation

Web do ument lassi� ation

�

COWCat: lassi� ation s heme based on Sharo� [2006℄

�

Five dimensions:

Authorship, Mode, Audien e, Aim, Domain

�

Annotation guidelines available from:

http://hpsg.fu-berlin.de/ ow/files/ ow at2013.pdf


http://hpsg.fu-berlin.de/cow/files/cowcat2013.pdf


Evaluation

UKCOW2012: Mode

Mode

Type % 95% CI �%

Written 90.5 2.8

Spoken 0.7 0.8

Quasi-Spont. 6.6 2.4

Blogmix 2.2 1.4



Evaluation

UKCOW2012: Audien e

Audien e

Type % 95% CI �%

General 87.4 3.2

Informed 7.3 2.5

Professional 5.3 2.2



Evaluation

UKCOW2012: Authorship

Authorship

Type % 95% CI �%

Single, female 7.8 2.6

Single, male 21.6 4.0

Multiple 13.8 3.3

Corporate 26.0 4.2

Unknown 30.8 4.5



Evaluation

UKCOW2012: Aim

Aim

Type % 95% CI �%

Re ommendation 10.2 2.9

Instru tion 1.9 1.3

Information 10.7 3.0

Dis ussion 75.5 4.2

Fi tion 1.7 1.2



Evaluation

Extrinsi evaluation: Biemann et al. [2013℄

Task oriented evaluation: ollo ation extra tion

�

Work by Stefan Evert

�

Di�erent orpora (web and traditional) ompaired

against a gold standard

size POS basi

name (tokens) orpus type tagged lemmatized unit

BNC 0.1 G referen e orpus � � text

WP500 0.2 G Wikipedia � � fragment

Wa kypedia 1.0 G Wikipedia � � arti le

ukWaC 2.1 G web orpus � � web page

WebBase 3.3 G web orpus � � paragraph

UKCOW 4.0 G web orpus � � senten e

LCC 0.9 G web orpus � � senten e

LCC (f ¥ k) 0.9 G web n-grams � � n-gram

Web1T5 (f ¥ 40) 1000.0 G web n-grams � � n-gram



Evaluation

Verb-parti le ombinations Biemann et al. [2013℄ (II)

�

Gold standard: 3,078 English verb-parti le ombinations,

manually lassi�ed as non- ompositional ( arry on, kno k out)

or ompositional bring together, peer out)

�

Extra ted o-o urren e ounts for the word pairs

(3-word span to the right of the verb)

�

Clean, annotated web orpora seem to be a valid repla ement

for a traditional referen e orpus su h as the BNC.

�

Diversity may be a relevant fa tor: Web orpora of more than

1G words are needed to rival the 100M words BNC.

�

Web1T5: sheer size annot make up for �messy� ontent

and la k of annotation.



Evaluation

Referen es I

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very

large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):

209�226, 2009.

P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a

meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.

C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,

and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. L. Wiener. Graph

stru ture in the web. In Pro eedings of the 9th International World Wide Web onferen e on

Computer Networks: The International Journal of Computer and Tele ommuni ations Networking,

pages 309�320. North-Holland Publishing Co, 2000.

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Synta ti lustering of the Web. Te hni al

Note 1997-115, SRC, Palo Alto, July 25 1997.

J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,

22(2):249�254, 1996.

P. Cook and G. Hirst. Do web- orpora from top-level domains represent national varieties of English? In

Pro eedings of the 11th International Conferen e on the Statisti al Analysis of Textual Data, pages

281�293, Liège, 2012.

G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation

onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.

K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly

Hills, 1980.



Evaluation

Referen es II

C. Manning, P. Raghavan, and H. S hütze. An Introdu tion to Information Retrieval. CUP, Cambridge,

2009.

M. A. Serrano, A. Maguitman, M. Boguñá, S. Fortunato, and A. Vespignani. De oding the stru ture of

the WWW: A omparative analysis of Web rawls. ACM Trans. Web, 1(2), 2007.

S. Sharo�. Creating general-purpose orpora using automated sear h engine queries. In M. Baroni and

S. Bernardini, editors, WaCky! Working papers on the Web as Corpus, pages 63�98. GEDIT,

Bologna, 2006.


linguistic research with large annotated web...

Documents