807 - text analytics massimo poesio lecture 7: wikipedia for text analytics

84
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Upload: ursula-blair

Post on 27-Dec-2015

250 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 7 Wikipedia for Text Analytics

WIKIPEDIA

bullWikipedia is a free multilingual encyclopedia project supported by the non-profit Wikimedia FoundationbullWikipedias articles have been written collaboratively by volunteers around the worldbullAlmost all of its articles can be edited by anyone who can access the Wikipedia website

The free encyclopedia that anyone can edit

----httpenwikipediaorgwikiWikipeida

WIKIPEDIA

bull Wikipedia is

1 domain independentndash it has a large coverage

2 up-to-datendash to process current information

3 multilingualndash to process information in many languages

bullTitle

bullAbstract

bullInfoboxes

bullGeo-coordinates

bullCategories

bullImages

bullLinks

bullOther languages

bullOther wiki pages

bullTo the web

bullRedirects

bullDisambiguates

WIKIPEDIA FOR TEXT ANALYTICS

bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 2: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA

bullWikipedia is a free multilingual encyclopedia project supported by the non-profit Wikimedia FoundationbullWikipedias articles have been written collaboratively by volunteers around the worldbullAlmost all of its articles can be edited by anyone who can access the Wikipedia website

The free encyclopedia that anyone can edit

----httpenwikipediaorgwikiWikipeida

WIKIPEDIA

bull Wikipedia is

1 domain independentndash it has a large coverage

2 up-to-datendash to process current information

3 multilingualndash to process information in many languages

bullTitle

bullAbstract

bullInfoboxes

bullGeo-coordinates

bullCategories

bullImages

bullLinks

bullOther languages

bullOther wiki pages

bullTo the web

bullRedirects

bullDisambiguates

WIKIPEDIA FOR TEXT ANALYTICS

bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 3: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA

bull Wikipedia is

1 domain independentndash it has a large coverage

2 up-to-datendash to process current information

3 multilingualndash to process information in many languages

bullTitle

bullAbstract

bullInfoboxes

bullGeo-coordinates

bullCategories

bullImages

bullLinks

bullOther languages

bullOther wiki pages

bullTo the web

bullRedirects

bullDisambiguates

WIKIPEDIA FOR TEXT ANALYTICS

bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 4: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bullTitle

bullAbstract

bullInfoboxes

bullGeo-coordinates

bullCategories

bullImages

bullLinks

bullOther languages

bullOther wiki pages

bullTo the web

bullRedirects

bullDisambiguates

WIKIPEDIA FOR TEXT ANALYTICS

bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 5: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR TEXT ANALYTICS

bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 6: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet

and Mesh Wikipedia itself is not a structured thesaurus

bull However it is morehellipndash Comprehensive it contains 12 million articles (28

million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can

compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are

absorbed timely

Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 7: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurus

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 8: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 9: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected links

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 10: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 11: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system

in which each article belongs to at least one category

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 12: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 13: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia as Thesaurus

bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed

phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by

redirected linksndash It contains a hierarchical categorization system in

which each article belongs to at least one category ndash Polysemous concepts are disambiguated by

Disambiguation Pages

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 14: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 15: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING

bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems

bull A number of possibilitiesndash Use similarity between documents and Wikipedia

pages on a given topic as a feature for text classification

ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 16: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Using Wikipedia Categories for text classification

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 17: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

17

WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science

psychology)ndash Booksndash Learning objects

ldquoThe United States was involved in the Cold Warrdquo

United States03793

Cold War03111

Vietnam War00023

World War I00023

Communism00027

Ronald Reagan00027

Michail Gorbachev00023

Cat Wars Involvingthe United States000779

Cat Global Conflicts000779

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 18: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

USING WIKIPEDIA FOR TEXT CLASSIFICATION

bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories

bull Use the documents associated with those categories as training documents

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 19: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

TEXT WIKIFICATION

Wikification = adding links to Wikipedia pages to documents

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 20: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bull Text

WIKIFICATION

bull Wikipedia

20May 2012 Truc-Vien T Nguyen

Giotto was called to work in Padua and also in Rimini

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 21: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 22: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Keyword Extraction

bull Finding important wordsphrases in raw textbull Two-stage process

ndash Candidate extractionbull Typical methods n-grams noun phrases

ndash Candidate rankingbull Rank the candidates by importancebull Typical methods

ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic

features

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 23: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Keyword Extraction using Wikipedia

1 Candidate extractionbull Semi-controlled vocabulary

ndash Wikipedia article titles and anchor texts (surface forms)

bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo

ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 24: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Keyword Extraction using Wikipedia

2 Candidate rankingbull tf idf

ndash Wikipedia articles as document collection

bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than

expected by chance

bull Keyphraseness

)(

)()|(

W

key

Dcount

DcountWkeywordP

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 25: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)

bull Use Wikipedia dump to compute two statistics

bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article

bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article

bull Two versions of system

bull UNSUPERVISED use statistics only

bull SUPERVISED use distant learning to create training data

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 26: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

KEYPHRASENESS

bull the probability that a term t is a link to a Wikipedia article

(cfr Milne amp Wittenrsquos prior link probability)

bull Examplesbull The term Georgia

ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466

bull Cfr the term ldquotherdquo keyphraseness = 00006

euro

Keyphraseness(t) =count([_ | t])

count(t)

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 27: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

COMMONNESS

bull the probability that a term t is a link to a SPECIFIC Wikipedia article a

bull for example the surface form Georgia was found to be linked to

ndash a1 = University_of_Georgia 166 times

commonness(t a1) = 166(166+18+5) = 08783

ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

euro

Commonness(ta) =count([a | t])

count(t)

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 28: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Extracting dictionaries and statistics from a Wikipedia dump

bull Parsingbull In three phases

bull Identify articles of relevancebull Extract (among other things)

bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)

bull Set of LINKS [article|surface_form]

bull [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 29: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The Wikipedia Dump from July 2011

ndash 11459639 pages in totalndash 12525583 links

bull specifying surface word target frequency

ndash ranked by frequency bull for example the mention Georgia is linked to

ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times

May 2012 29Truc-Vien T Nguyen

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 30: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Some statistics (all Wikidumps from July 2011)

Page Type English Italian Polish

Redirected 4465652 323591 134148

List_of 138581 836 5021

Disambiguation 176721 6193 4553

Relevant 4361020 917354 920486

Total 11459639 1654258 1200313

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 31: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Surface forms titles articles

Dictionary English Italian Polish

Titles 4361020 917354 920486

Surface forms 8829624 2484045 2482104

Files 745724 72126 na

Links 10871741 2917235 2937981

Files in Polish are arranged in a repository different from EnglishItalian

Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 32: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The Unsupervised Approach

bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is

above a certain threshold (currently 001)bull Use commonness to rank

bull Retain top 10

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 33: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The Supervised Approach

bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page

bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)

euro

Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))

log(W ) minus log(min( A1 A2 ))

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 34: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Training a supervised wikifier

bull Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 35: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Results on standard datasets

APPROACH AQUAINT WIKIPEDIA

Our approach 8566 8437

MilneampWitten 2008 8361 8031

Ratinov et al 2011 8452 9020

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 36: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bull BAL Data setsndash 1049 Query set

bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation

ndash 100 Query setbull 3 annotators each up to 3 manual annotations

Wikifying queries the Bridgeman datasets

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 37: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Results on Bridgeman 1000 Y3

CORRECT CANDIDATE IS RESULTS

First candidate 6477

Among first 2 7159

First 3 7542

First 4 7718

First 5 7832

Accuracy up by 17 points (36)

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 38: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Results for the GALATEAS languages and Arabic

LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)

English 4M articles 8437

Italian 1M 7964

French 14M 76-77

German 16M 72-73

Dutch 16M 70-71

Polish 900K 6081

Arabic 200K 8078

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 39: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The GALATEAS D2W web services

bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation

Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput

of 600 characters per secondbull Integrated with LangLog tool

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 40: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Use of the service in LangLog

(See Domoinarsquos demo)

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 41: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Other applications

bull The UK Data Archive

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 42: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR NER

[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 43: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR NER

httpenwikipediaorgwikiFCC

The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 44: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR NER

Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 45: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 46: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA FOR NER

bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials

(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY

DISAMBIGUATION

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 47: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Distant learning

bull Automatically extract examples

bull positive examples from mention-to-link Wikipedia page

bull Negative examples from similar mentions with other links

bull Use positive and negative examples to train model

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 48: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The Supervised Approach Using Wikipedia links to generate training data

bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)

Giotto_Bizzarrini (automobile engineer)

bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini

May 2012 48Truc-Vien T Nguyen

httpenwikipediaorgwikiGiotto_di_Bondone

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 49: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

MORE ADVANCED USES OF WIKIPEDIA

bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 50: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA

bull Taxonomic information category structurebull Attributes infobox text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 51: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 52: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 53: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Deriving a taxonomy from Wikipedia (AAAI 2007)

bull Induce a subsumption hierarchy

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 54: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

INFOBOXES

bull Collaborative content

bull Semi-structured data

Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 55: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an

open licensebull interlink the DBpedia dataset with other datasets on the

Web

DBPEDIA

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 56: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

1048607 1600000 concepts

1048607 including

1048698 58000 persons

1048698 70000 places

1048698 35000 music albums

1048698 12000 films

1048607 described by 91 million triples

1048607 using 8141 different properties

1048607 557000 links to pictures

1048607 1300000 links external web pages

1048607 207000 Wikipedia categories

1048607 75000 YAGO categories

The DBpedia Dataset

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 57: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data

REPRESENTING EXTRACTED INFORMATION

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 58: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

httpenwikipediaorgwikiCalgary

httpdbpediaorgresourceCalgary

dbpedianative_name Calgaryrdquo

dbpediaaltitude ldquo1048rdquo

dbpediapopulation_city ldquo988193rdquo

dbpediapopulation_metro ldquo1079310rdquo

mayor_name

dbpediaDave_Bronconnier

governing_body

dbpediaCalgary_City_Council

Extracting Infobox Data (RDF Representation)

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 59: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

SPARQL

bull SPARQL is a query language for RDF

bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF

bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 60: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

1048607 httpdbpediaorgsparql

1048607 hosted on a OpenLink Virtuoso server

1048607 can answer SPARQL queries like

1048698 Give me all Sitcoms that are set in NYC

1048698 All tennis players from Moscow

1048698 All films by Quentin Tarentino

1048698 All German musicians that were born in Berlin in the 19th century

The DBpedia SPARQL Endpoint

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 61: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and

Language Laboratory hellipbull This has been taken advantage of in AI

ndash Open Mind Commonsense (Singh) (collecting facts)

ndash Semantic Wikis

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

wwwphrasedetectivescom

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 62: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bull Open Mind Common Sense ndash Singh

bull Crater mapping (results) ndash Kanefsky

bull Learner Learner2 1001 Paraphrases ndash Chklovski

bull FACTory ndash CyCORP

bull Hot or Not ndash 8 Days

bull ESP Phetch Verbosity Peekaboom ndash von Ahn

bull Galaxy Zoo ndash Oxford University

WEB COLLABORATION PROJECTS

wwwphrasedetectivescom

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 63: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

OPEN MIND COMMONSENSE

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 64: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)

THINGS (52000 assertions)

IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)

EVENTS (38000 assertions)

PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)

AGENTS (104000 assertions)

CapableOf (CapableOf dentist pull tooth)

SPATIAL (36000 assertions)

LocationOf (LocationOf army in war)

TEMPORAL time amp sequence

CAUSAL (17000 assertions)

EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)

AFFECTIONAL (mood feeling emotions) (34000 assertions)

DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)

FUNCTIONAL (115000 assertions)

IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)

ASSOCIATION K-LINES (125 million assertions)

SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 65: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

CONCEPT NET

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 66: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

GAMES WITH A PURPOSE

bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 67: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

bull The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 68: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

EXAMPLES OF GWAP

bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune

bull Other gamesndash Peekaboomndash Phetch

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 69: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

ESP

bull The first GWAP developed by von Ahn and their group (2003 2004)

bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision

bull The goal label the majority of the images on the Web

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 70: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

ESP the game

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 71: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

ESP THE GAMEbull Two partners are picked at random from the

large number of players onlinebull They are not told who their partner is and canrsquot

communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe

the image and type that descriptionndash Hence the ESP game

bull If any of the strings typed by one player matches the string typed by the other player they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 72: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

THE TASK

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 73: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

SCORING BY MATCHING

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 74: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

SOME STATISTICS

bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once

bull By 2008 ndash 200000 playersndash 50 million labels

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 75: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

QUALITY OF THE LABELSbull For IMAGE SEARCH

ndash choose 10 labels among those produced and look at which images are returned

bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more

than 5 labelsndash 83 of game labels also produced by participants

bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 76: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 77: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

THE TASK

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 78: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

RESULTS

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 79: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

PHRASE DETECTIVES

wwwphrasedetectivesorg

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 80: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

bull 2 tasks

ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric

ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user

wwwphrasedetectivescom

PHRASE DETECTIVES THE TASKS

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 81: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

NAME THE CULPRIT

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 82: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

READINGS

bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal

bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data

bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH

bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84
Page 83: 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics

READINGS

bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67

bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems

  • 807 - TEXT ANALYTICS
  • WIKIPEDIA
  • Slide 3
  • Slide 4
  • WIKIPEDIA FOR TEXT ANALYTICS
  • Wikipedia as Thesaurus for text classification clustering
  • Wikipedia as Thesaurus
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
  • Slide 13
  • The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
  • WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
  • Using Wikipedia Categories for text classification
  • WIKIPEDIA FOR TEXT CLASSIFICATION
  • USING WIKIPEDIA FOR TEXT CLASSIFICATION
  • TEXT WIKIFICATION
  • WIKIFICATION
  • Wikification pipeline
  • Keyword Extraction
  • Keyword Extraction using Wikipedia
  • Slide 24
  • Slide 25
  • KEYPHRASENESS
  • COMMONNESS
  • Slide 28
  • The Wikipedia Dump from July 2011
  • Some statistics (all Wikidumps from July 2011)
  • Surface forms titles articles
  • Slide 32
  • Slide 33
  • Training a supervised wikifier
  • Results on standard datasets
  • Wikifying queries the Bridgeman datasets
  • Results on Bridgeman 1000 Y3
  • Results for the GALATEAS languages and Arabic
  • The GALATEAS D2W web services
  • Use of the service in LangLog
  • Other applications
  • WIKIPEDIA FOR NER
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • The Supervised Approach Using Wikipedia links to generate training data
  • MORE ADVANCED USES OF WIKIPEDIA
  • SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
  • Wikipedia category network
  • Deriving a taxonomy from Wikipedia (AAAI 2007)
  • Slide 53
  • INFOBOXES
  • Slide 56
  • Slide 57
  • Slide 58
  • SPARQL
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • OPEN MIND COMMONSENSE
  • Slide 65
  • CONCEPT NET
  • GAMES WITH A PURPOSE
  • GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
  • EXAMPLES OF GWAP
  • ESP
  • ESP the game
  • ESP THE GAME
  • THE TASK
  • SCORING BY MATCHING
  • SOME STATISTICS
  • QUALITY OF THE LABELS
  • GOOGLE IMAGE LABELLER
  • Slide 78
  • RESULTS
  • PHRASE DETECTIVES
  • Slide 81
  • NAME THE CULPRIT
  • READINGS
  • Slide 84