knowledge extraction in web media: at the frontier of nlp, machine learning and semantics

Julien [email protected]

@julienplu

Knowledge extraction in Web media: at the frontier of NLP,

Machine Learning and Semantics

Use Case: Bringing Context to Documents

2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3

NEWSWIRES

TWEETS

SEARCHQUERIES

SUBTITLES

Use Case: Bringing Context to Documents

James Patrick Page, OBE (born 9 January 1944) is an English musician, songwriter, and record producer who achieved international success as the guitarist and founder of the rock band LedZeppelin. Know More

Sort name: Page, JimmyType: PersonGender: MaleBorn: 1944-01-09 (72 years ago)Born in: Heston, Hounslow, London,United Kingdom

Pays d’origine : Royaume-UniGenre musical : Blues rock, rockpsychédéliqueAnnées actives : 1962-1968 etdepuis 1992Labels : Columbia

The Yardbirds est un groupe de rock britannique des années 1960, formé en mai 1963 à Londres en Angleterre dont les guitaristes ont été EricClapton, Jeff Beck puis Jimmy Page. Know More


Six Different Problems

1. Identity of an entityØ Arena; Arena (magazine); Arena (TV series)

Ø Bucks County, Pennsylvania; Milwaukee Bucks

2. Knowledge bases have different coverage

Yannick Noah is aTennis Player and aSinger

4. Various types for an entity (granularity) 5. Different type of

documents written in multiple languages

3. High computation to handle large streams

6. Are all phrases entities? (e.g. dates or roles)


Research Questions

1. How to adapt an entity linking system depending on different criteria?

2. How to design an entity linking system in order to be able to process a large amount of data in near real time?


State Of The Art

§ The key role of entities:

Ø 70% of search queries contain at least one entity [1]

Ø Bring context to videos [2]

Ø Help making summary [3]

§ Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpediaSpotlight [6]) are hardly parametrized and often do not propose to be adapted to at least one of the previous criteria

§ Those solutions are often not able to handle large streams of text

[1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010[2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: KnowledgeExtraction for Semantic Annotation of News Items. K-CAP 2015[3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014[4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). CIKM 2010[5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: An Online Tool for AccurateDisambiguation of Named Entities in Text and Tables. PVLDB 4(12)[6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach.TACL 2014[7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents.I-SEMANTICS 2011


Methodology

We have split up this thesis into six tasks:Start thesis

Today

End thesis

(1) Text adaptivity

(1) Entity type adaptivity

(1) Knowledge base adaptivity

(1) Language adaptivity

(1- 2) ADEL Modular framework

(2) Distributed and scalable architecture


§ POS Tagger:Ø bidirectional

CMM (left to right and right to left)

§ NER Combiner:Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method)

models. A simple CRF model could be:

PER PER PERO OOO

X X X X XX XXXX

X set of features for the current word: word capitalized, previous word is “de”, next word is aNNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF

Jimmy Page , connaissant le profesionnalisme de John Paul Jones

ADEL: Modular Framework (Extractors)

PER PERO


ADEL: Modular Framework (Overlap Resolution)

§ Detect overlaps among extractors with the boundaries of the entities

§ Different heuristics can be applied:Ø Merge: (“United States” and “States of America” => “United States of

America”) default behavior

Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”)

Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”)


Modular Framework: Indexing

§ Create index from DBpedia and Wikipedia

§ Integrate external data such as PageRank and HITS scores from HassoPlatner Institute


ADEL: Modular Framework (Linking)

§ Generate candidate links for all extracted mentions:Ø If any, they go to the linking

method

Ø If not, they are linked to NIL

§ Linking method:Ø ADEL linear formula:

𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙)

r(l): the score of the candidate lL: the Levenshtein distancem:the extracted mentiontitle: the title of the candidate lR: the set of redirect pages associated to the candidate lD: the set of disambiguation pages associated to the candidate lPR: Pagerank associated to the candidate l

a,band c are weights following the properties:a>b>c and a+b+c=1


ADEL: Modular Framework (Pruning)

§ k-NN machine learning algorithm

§ Why a pruning module?Ø Useful to correct the errors from the extractor by removing wrong

annotations. Example:F France played against Russia for a friendly match.

F Yesterday, I went to see Against in concert.

Ø Useful to adapt the annotations in order to follow a given guideline. Example: suppose we are participating to two different challenges, 2014 NEEL that count the dates as entities, and OKE2015 that do not.F1st challenge: Jimmy Page was born the January 9th, 1944.

F2nd challenge: Jimmy Page was born the January 9th, 1944.


§ Experiments on different kind of text by benchmarking ADEL over different challengesØ Tweets: NEEL2014, NEEL2015 and NEEL2016ØNews article: OKE2015 and OKE2016

§ Need to adapt the extractors to use a proper model to handle different kind of textsØRetrain the NER extractor with a training dataset

Text Adaptivity


Type Adaptivity

§ Challenges have their own definition of types

§ In ADEL types are coming from the NER extractor and the used knowledge baseØNER types are different of KB types

ØNER types and KB types are different of challenges types

§ Need a mapping between those different types. It is currently manually made.

OKE2015 and OKE2016 Person, Place, Organization, Role

NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing


Knowledge Base Adaptivity

§ Joint work with Vrije Universiteit Amsterdam

§ ReCon: define several heuristics in order to re-rank candidate links provided by our system on newswire articlesØH1: process the article text first and disambiguate the article

title at the end because titles are often too ambiguous

ØH2: detect co-referential entities throughout the article

ØH3: topic modeling to exploit a contextual knowledge base about the found topic


Language Adaptivity

§ No results yet. The goal is to let the user choosing the natural language used in the text

§ Test the framework on ETAPE which is a NER challenge on French TV content from 2012


Distributed and Scalable Architecture

§ No results yet. Being able to deploy the framework in order to run the tasks in a distributed and scalable way

§ Making each task (extraction, linking and pruning) independent of each other and put them out of the global architecture (see how Docker is developed as model)

§ Stress test the new architecture over large streams such as Twitter streaming API to detect the possible bottlenecks


Evaluation Over Multiple Datasets in Linking

§ 2014 NEEL Challenge with ADEL v1 using the neleval scorer



§ OKE2015 Challenge with ADEL v1 using the GERBIL scorer

§ OKE2016 Challenge with ADEL v2 using the neleval scorer

E2E UTwente DataTXT ADEL AIDA Hyberabad SAP

F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02

ADEL FOX FRED

F-measure 60.75 49.88 34.73

ousia acubelab ADEL uniba ualberta uva cen_neel

F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0

ADEL kea Insight mit ju unimib

F-measure 61.98 54.86 38,28 36.09 35.48 33.53

ADEL

F-measure 56.5


Conclusions

§ Combining multiple techniques coming from different domains for entity recognition and linking

§ Having developed different methods in order to make an entity linking system adaptive to one or multiple criteria

§ Bringing a new approach with ADEL while also reusing existing approaches with the POS and NER extractors

§ Testing ADEL over different datasets and participating in challenges


Future Work

§ Knowledge base adaptivityØ Further evaluate the knowledge base and text adaptive features using the ERD dataset

Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset

Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset

§ Language adaptivityØ Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets

§ Modular FrameworkØ Improving the linking and the pruning with new methods (e.g. evaluate deep learning

methods)

§ Type adaptivityØ Further evaluate the approach over more fine grained types using ETAPE challenge. This will

bring more issues especially with the scorers

§ Engineer and evaluate a distributed and scalable architecture on large data streams


Questions?

Thank you for listening!


knowledge extraction in web media: at the frontier of nlp, machine learning and semantics

Software