the semeval-2007 web people search evaluation the semeval-2007 web people search evaluatin javier...

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluatin

The SemEval-2007 WePS EvaluationEstablishing a Benchmark for the Web People Search Task

Javier Artiles, Julio Gonzalo, Satoshi Sekine

UNED NLP & IR GroupMadrid, Spain

nlp.uned.es/~{javier, julio}

CS DepartmentNew York University, USAnlp.cs.nyu.edu/sekine

Aarhus, 19 Sep 2008

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS Task

The Web People Search problem

The WePS Task

The WePS 1 Task

Input: first 100 results for a person name search

Output: clustering according to the actual people

John Smith 1 (Captain)

Captain John Smith - www.apva.org

John Smith Wikipedia - en.wikipedia.org/wiki…

John Smith 2 (Labour leader)

BBC: Labour leader John Smith – news.bbc.co.uk…

John Smith Wikipedia - en.wikipedia.org/wiki…

John Smith 3 (IBM researcher)

John Smith 4 (Film director)

John Smith 5 (Shoe company)

John Smith 6 (Writer)

The WePS Task

Person names are a frequent type of search in the Web Approx. 30% of queries to Web search engines include a p.n.

But names can be very ambiguous.90,000 names are shared by 100 million people according to the U.S. Census Bureau.

We can find:

– High ambiguity (e.g. 82 different people in 100 pages that mention “Martha Edwards”)

– Monopolized names (e.g. +100 top results for the search “Scarlett Johansson” only mention the famous actress)

Final task with a clear application.

The WePS TaskWhy the WePS Task ?

Why the WePS Task ?

Connections with traditional WSD, but also some exciting differences:

– Unknown number of “senses” (sense discrimination)– Much more average ambiguity…– … but sharper boundaries between senses.– A document might refer to different people with the same

ambiguous name (multiclass problem).

Receiving increasing attention from IR/IE research community and companies:– ZoomInfo people search engine (www.zoominfo.com).– Spock (www.spock.com).

Also a relevant multilingual task!

Connections with traditional WSD, but also some exciting differences:

– Unknown number of “senses” (sense discrimination)– Much more average ambiguity…– … but sharper boundaries between senses.– A document might refer to different people with the same

ambiguous name (multiclass problem).

Receiving increasing attention from IR/IE research community and companies:– ZoomInfo people search engine (www.zoominfo.com).– Spock started a similar challenge just a few months ago

(www.spock.com).

ObjectivesData

Data: training and test datasets

name sources av. entities av. documents name sources av. entities av. documents

Wikipedia 23,14 99,00 Wikipedia 56,50 99,30

ECDL06 15,30 99,20 ACL06 31,00 98,40

WEB03 * 5,90 47,20 Census 50,30 99,10

total av. 10,76 71,02 total av. 45,93 98,93

Training Test

Random selection of names.

Different sources (Wikipedia, US Census, CS conferences).

For each person name retrieve at most the top 100 documents (Yahoo! API).

Manual clustering of each set of documents.

* Gideon S. Mann, "Multidocument Statistical Fact Extraction and Fusion",

2006, Johns Hopkins University.

ObjectivesData

Data: training and test datasets

Different name sources should provide different ambiguity scenarios.

But we found… High and unpredictable variability across test cases This affected the balance between training and test. And added an (unintentional) challenge for systems.

name sources av. entities av. documents name sources av. entities av. documents

Wikipedia 23,14 99,00 Wikipedia 56,50 99,30

ECDL06 15,30 99,20 ACL06 31,00 98,40

WEB03 * 5,90 47,20 Census 50,30 99,10

total av. 10,76 71,02 total av. 45,93 98,93

Training Test

* Gideon S. Mann, "Multidocument Statistical Fact Extraction and Fusion",

2006, Johns Hopkins University.

ObjectivesEvaluation measures and Baselines

Evaluation measures and Baselines

Purity: rewards less noise in each cluster.

Inverse Purity: rewards elements of a category grouped

F-measure =0,5 : harmonic mean of Purity and Inverse Purity.

Scattered

P: 1,00IP: 0,48F

0,5: 0,65

Joined

P: 0,50 IP: 1,00F0,5: 0,67

Combined

P: 0,75IP: 1,00F0,5: 0,86

1 12 2

3 34 4

6 65 5

Results

16 groups submitted (largest single task at Semeval)

Other issues

Current standard clustering evaluation measures can be cheated (see combined baseline).

Adapted B-Cubed measure.Enrique Amigó, Julio Gonzalo and Javier Artiles (2007), Evaluation metrics for clustering tasks: a comparison based on formal constraints.http://nlp.uned.es/docs/amigo2007a.pdf

Inter-annotator agreement?

Double annotation of WePS test data.

No significant swaps in the ranking.

Results

WePS 2

Clustering task (group documents by person) + Information Extraction task (extract person attributes)

Workshop in April 2009 (together with WWW 2009), in Madrid.

More info: http://nlp.uned.es/weps

Results

the semeval-2007 web people search evaluation the semeval-2007 web people search evaluatin javier...

web people

web search engines

different people

zoominfo people

search output

actual people john smith

satoshi sekinethe semeval

weps evaluation

Documents

developing a successful semeval task in sentiment analysis...

semeval-2019 task 10: math question answering

semeval-2021 task 4: reading comprehension of abstract...

semeval-2016 task 3: community question answering€¦ ·...

whole lotta evaluatin goin on

semeval-2018 task 1: affect in tweets · we present the...

evaluatin q 1

semeval - aspect based sentiment analysis

semeval-2015 task 3: answer selection in community question...

evaluatin - q2

semeval-2012 task 2: measuring degrees of relational...

chapter 3 evaluatin your communication skills

semeval-2017 task 3: community question answering · in...

graded annotations of word meaning in context ·...

semeval-2017 task 11: end-user development using natural...

blcu nlp at semeval-2019 task 7: an inference chain-based...

developing a successful semeval task in sentiment …...2...

job design and evaluatin ppt

a replication study of the top performing systems in semeval...

evaluating websites - home | mt. san antonio college ›...