web scale named entity mining

52
Web scale Named Entity Mining "There's simply too much information out there" WI-IAT 2011

Upload: francois-pouilloux

Post on 20-Jun-2015

1.408 views

Category:

Technology


2 download

DESCRIPTION

presentation given at the Industry Day of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence http://wi-iat-2011.org/

TRANSCRIPT

Page 1: Web Scale Named Entity Mining

Web scale

Named Entity Mining

"There's simply too much information out there"

WI-IAT 2011

Page 2: Web Scale Named Entity Mining

in memoriam of

Herbert A. Simon …

Page 3: Web Scale Named Entity Mining

stuck

April 2011

Page 4: Web Scale Named Entity Mining

Herbert Simon's Brookings Institute Lecture"Designing Organizations for an Information-Rich World"

Johns Hopkins University, September 1, 1969

Page 5: Web Scale Named Entity Mining

1.Tales & legends

Page 6: Web Scale Named Entity Mining

Find & procure a crystal plastic replacement of a polycarbonate LEXAN 943

Main constraints:

•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress

and exposure to detergent agents)

•compatible with existing tools - withdrawal must be close to LEXAN 943

•optical characteristic close to LEXAN 943

•weldable by ultrasonic welding

•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94

delay : one week

organization centric search

Page 7: Web Scale Named Entity Mining

Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air

defense missile system ?

location centric search

Page 8: Web Scale Named Entity Mining

Recent information (past month)

about call for proposal

"outils Web innovants en entreprise" ?

time centric search

Page 9: Web Scale Named Entity Mining

Location

"pro" searches focus on

Orgs People

Time

named entities

Page 10: Web Scale Named Entity Mining

2.Introducing

WebNEM

Page 11: Web Scale Named Entity Mining

relevant

query ?

query

again ?

where ?

+ browsing/ranking

results

Attention-greedy & burdensome

product

specifications

get

manufacturer

or distributor

find

compliant

products

Page 12: Web Scale Named Entity Mining

"SA-24 Grinch

9K338 Igla-S"

Goal : Attention-saver process

Page 13: Web Scale Named Entity Mining

exploratory data analysis

of high dimensional data

Page 14: Web Scale Named Entity Mining

"In exploratory data analysis of high dimensional data

one of the main tasks is the formation of a

simplified, usually visual, overview of data sets.

....

Clustering and projection

are among the examples of useful methods

to achieve this task."

Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their

application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)

Lourenço, Lobo, Bação – JOCLAD 2004

Page 15: Web Scale Named Entity Mining

WebNEM

collection of

relevant data,

anywhere in the web

+ projection on

Named Entities space

topical web crawler

named entity recognition

visualization/exploratory analysis tools

Page 16: Web Scale Named Entity Mining

"Web scale" collection : brute force

never-ending crawl

fast answer,

"any" topic

a priori

"whole" Web indexing

general index

"everywhere"

huge resources required

(data size based)

user

query

Page 17: Web Scale Named Entity Mining

"Web scale" collection : our approach

"close to optimal" resources

(usage based)

user

query

on-demand topical crawl

delayed answer,

but less garbage

tailored index

anywhere

relevant

built on order

Web slices

Page 18: Web Scale Named Entity Mining

Projection : when to extract entities ?

Named Entity Recognition is resource intensive

crawl time whole web 1010 asynchronous

query time collection 102 real-time

crawl time web slice 104 asynchronous

process step data size required response time

Page 19: Web Scale Named Entity Mining

www.squido.fr

our SaaS Web mining system

large scale

Named Entity extraction (EN/FR)

beta released to customers

June 2011

Page 20: Web Scale Named Entity Mining

WebNEM with Squido

index

focused

crawl

search

topicshallow

entity extraction

page

cleaning

user

queries

user

collections

deep

entity extractionvisualization

visualization

Page 21: Web Scale Named Entity Mining

Page cleaning

instead

of

this

work

on

this

fast heuristic

DOM processing

Page 22: Web Scale Named Entity Mining

Shallow extraction

detectlanguage

tokenizesentence

split

gazetteers grammar

Webdocs

format

parse

index

Page 23: Web Scale Named Entity Mining

Deep extraction

POStagger

grammar

orthomatcher index

morphoanalyzer

NP/VPchunker

≅≅≅≅ shallow extraction + elaborate linguistics

Page 24: Web Scale Named Entity Mining

3.Annoyances

Page 25: Web Scale Named Entity Mining

Linguistic processing throughput

deep extraction

too expensive

when crawling

shallow

extraction

OK

penalty

on

quality

workaround :

�asynch deep extraction

on smaller collections

�query time sanitization

Page 26: Web Scale Named Entity Mining

Page cleaning

need evaluation

goal : ↗accuracy ? cost : ↘ recall ?

performance impact ?

↘ +1 processing step

↗ less text in later steps

Page 27: Web Scale Named Entity Mining

"Multiple dates" usage ?

<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE>

<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE>

<DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE>

<DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE>

<DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE>

<DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE>

<DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE>

<DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE>

<DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE>

<DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE>

<DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE>

<DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE>

<DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE>

<DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE>

<DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE>

<DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>

retrieve

by date

sort

by date

?

Page 28: Web Scale Named Entity Mining

Publishing date ?

critical for

time centric

searches

published

05/2011tagged as

7 jul 2011

Page 29: Web Scale Named Entity Mining

& many more…

wrong

spelling

Tapei→Taipei

location is also a first name

"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)

compound first names

"Jean-Claude Marin"→Claude Marin

wrong character case (very frequent on titles)

breaks all case-based rules

barrack obama→not extracted

How To Buy Electric Trucks→Buy Electric (organization)

In Virginia Life Is Sweet→Virginia Life (person)

polymorphism

"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser output

for tokenization

transliteration, case, punctuation, …

Page 30: Web Scale Named Entity Mining

4. Results

Page 31: Web Scale Named Entity Mining

Reminder

Next results are obtained

automatically

from unstructured content

picked on the web

by an autonomous system,

without previous knowledge

of the topic or the visited Web sites

Page 32: Web Scale Named Entity Mining

Let's try it with a use case

"hydrogen storage for fuel cells"

What's inside a collection

of 66 highly ranked documents ?

run a few cycles

(shallow extraction only)

entity

weight function

(tf-idf, …)

some

104 pages

PeopleOrgs Location Time

Page 33: Web Scale Named Entity Mining

Special attention paid

to so-called outliers

Page 34: Web Scale Named Entity Mining

Organizations > 900 : overload…

page cleaning + entity sanitization

=> better details & accuracy

Page 35: Web Scale Named Entity Mining

↗attention ↘information : top 50

academic

team ?H2 military

usage ?

new questions are instantly popping up

?

Page 36: Web Scale Named Entity Mining

People

authors lead to

relevant content

(classic IR method,

even in libraries !)

?

Page 37: Web Scale Named Entity Mining

Countries

political threats

on Lithium battery

supplies

argument in favor of

H2 technology

Page 38: Web Scale Named Entity Mining

Cities

"Austin is in a unique position

to offer its electric grid as a

real world proving ground"

"Direct Methanol Fuel Cells"

⇒alternative to H2

!

!

!

Page 39: Web Scale Named Entity Mining

changeover from nickel to lithium

will be complete by 2016 and 2018

Multiple-dates timeline

outlookhistory

do

ma

ins

time

Honda President Takanobu Ito says

around 10 percent of Honda’s global sales

will be hybrids by 2015

Page 40: Web Scale Named Entity Mining

In a few clicks...

DMFC alternative to H2

Austin,

TX

hydrogen storage

for fuel cells ?

changeover from

nickel to lithium

by 2016/2018

Page 41: Web Scale Named Entity Mining

5. Perspectives

Page 42: Web Scale Named Entity Mining

To clean or not to clean ?

performance impact"attention" impact

run pipeline with/without cleaningcorpus

label examples +/-

clean

set

full

set

time full

pipeline

Page 43: Web Scale Named Entity Mining

Publishing date extraction

heuristic

DOM processing

prototype ready

need large scale

evaluation

build gold

standard from

RSS feeds

Page 44: Web Scale Named Entity Mining

A zest of Linked Data ?

too slow & fat

for crawling...

use it "offline"

disambiguation, gazetteers, infoboxes, ...

Page 45: Web Scale Named Entity Mining

Play with graphs

entity co-occurence, page similarity, ...

Page 46: Web Scale Named Entity Mining

UI/user experience

�search facets

�word clouds

�maps

�dashboards

�infoboxes

�highlighting

�graphs

Page 47: Web Scale Named Entity Mining

Lexical Taxonomies Induction

22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),

Barcelona, Spain, July 19-22nd, 2011

another kind of projection

Page 48: Web Scale Named Entity Mining

a. A real need of Attention-saving…

b. WebNEM results are encouraging

c. Work in progress, lots of paths to explore

6. Digest

Page 49: Web Scale Named Entity Mining

"There's simply

too much

information out

there."

"Leaders feel

misled. Stupid.

Trapped."

Page 50: Web Scale Named Entity Mining

Final word by Herbert Simon

"Filtering by intelligent programs

is the main part of the answer"

[to information overload]

Page 51: Web Scale Named Entity Mining

www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767www.linkedin.com/in/fpouilloux

MANY THANKS!joint work of

Page 52: Web Scale Named Entity Mining

CREDITSPhotos2. Home page, The 2011 IEEE/WIC/ACM International Conference on Web

Intelligence

4. Designing Organizations for an Information-Rich World, The Herbert A.

Simon Collection

5.Vlad the Impaler, Wikimedia commons

7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,

©vitalykuzmin.net

10. Internet Map 2005, ©www.opte.org

33. The Inspector, ©DePatie-Freleng Enterprises

36. Nanomaterials for Solid State Hydrogen Storage, book cover,

©springer.com

40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory

40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole

41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy

44. Linking Open Data cloud diagram, ©Richard Cyganiak and Anja

Jentzsch, lod-cloud.net

44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter

48. Views of the solar corona by the Transition Region and Coronal

Explorer, Stanford-Lockheed Institute for Space Research, NASA Small

Explorer program

49. Hyperformance book cover, www.tjwaters.com

50. Dr Simon solving puzzles, The Herbert A. Simon Collection

Websites� wi-iat-2011.org

� The Herbert A. Simon Collection, Carnegie Mellon University Libraries,

diva.library.cmu.edu/webapp/simon/index.html

� www.google.com

� online.barrons.com

� www.me.utexas.edu/~dmfc-muri

� www.alsace-industrie.fr

� www.hybridcars.com

� www.me.utexas.edu/blogs/meyersresearchgroup

Bibliography� Simon, H. A. (1971), "Designing Organizations for an Information-Rich

World", Carnegie Mellon University Libraries,

diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl

0002/doc0001

� Waters, T. J. (2011), "Hyperformance",

www.tjwaters.com/hyperformance-excerpt.html

� R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing

Lexical Taxonomies from Scratch. Proc. of the 22nd International Joint

Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July

19-22nd, 2011, pp. 1872-1877.