entity-oriented filtering of large streams john r. frank jrf@mit.edu ian soboroff...

Post on 18-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Entity-oriented filtering of large streams

John R. Frankjrf@mit.edu

Ian Soboroffian.soboroff@nist.gov

Max Kleiman-Weinermaxkw@mit.edu

Dan A. Robertsdrob@mit.edu

http://trec-kba.org

Date: Tue, 13 Mar 2012 02:45:40 +0000

From: Google Alerts <googlealerts-noreply@google.com>

Subject: Google Alert - "John R. Frank"

=== Web - 2 new results for ["John R. Frank"] ===

John R. Frank

SPOKANE, Wash. - John R. Frank, 55, died March 4, 2012, in Coeur d' Alene,

Idaho. Survivors include: his wife, Miki; daughter, Patricia Frank; ...

<http://www.hutchnews.com/obituaries/Frank--John-CP>

In Memory of John R Frank

Biography. John R. Frank, age 55, passed away at Sacred Heart Medical

Center in Spokane, WA, on March 4, 2012. John was born in Hutchison, KS, ...

<http://www.englishfuneralchapel.com/sitemaker/sites/Englis1/obit.cgi?user=583335Frank>

2012 Task:Filtering to Recommend Citations

1) Initialize with a target WP entity• state of WP from Jan 2012

2) Iterate over stream of text items• Oct-Dec 2011: train on labels

3) For each, output confidence between 0, 1• Jan-Apr 2012: labels hidden

Content Stream•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

Content Stream•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

Your KBA System

Entities in Wikipedia or another Knowledge Base

Automatically recommend

new editsDiffeo

Sponsors:

s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/

Accelerate?

rate of assimilation << stream size

# editors << # entities << # mentions

(definition of a “large” KB)

How many days must a news article wait before being cited in Wikipedia?

Complex entity with many relationships and attributes.

Has many interests, including trying to takeover UK soccer teams.

His empire includes many entities…

Note: Usmanov not mentioned in this text!

Citation #18

Elaborate link trails…

Example KBA Rating Task

Published: March 31, 2012Impact of Thoughts on WaterBy Denis Gorce-Bourge

Water covers 70% of our Blue planet and our body is made of about 70% water.

Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness.

For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals.

Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water.

The very shape of water crystals is modified by violence, aggression, and negative words.

Example KBA Rating Task

Published: March 31, 2012Impact of Thoughts on WaterBy Denis Gorce-Bourge

Water covers 70% of our Blue planet and our body is made of about 70% water.

Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness.

For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals.

Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water.

The very shape of water crystals is modified by violence, aggression, and negative words.

97.6% +/- 1.4% (N=5365) coref

69.5% +/- 2.7% (N=1352) central

70.9% +/- 2.0% (N=2403) relevant

58.4% +/- 3.4% (N=884) neutral

84.9% +/- 2.0% (N=2599) garbage

82.6% +/- 1.8% (N=3200) central relevant

89.0% +/- 1.7% (N=3551) central relevant neutral

Interannotator Agreement

IR:•User task centric•Variation in interpretation•Scores cascading lists•Constructionist, emergence

NLP:•Data parsing centric•Universal annotation•Scores probabilities•Reductionist

TRECing the continental divide between NLP and IR

string matching

task generator 91% recall15% precision26% F1

KBA 2013More entity types with an emphasis on temporality in the stream.

Target Entities KB Centrally Relevant

Training Data Annotation

People and Organizations

Wikipedia ormaybe Freebase

Citation worthy Judgments from early stream

High recall on all mentioning docs.

Pharmaceutical Compounds

Merck KB? Reporting of Adverse Drug Reaction (ADR)(an event)

(same) Focus recall on first person reporting & negative reactions?

Event-type Entities

Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy

WP/FB for cluster of entities, possibly also event itself.

Provides causality info

Judgments on docs for that Type-of-Event but different specific event.

Find training data from TDT?Use citations in Category:Current_events?

Judge post-hoc?

KBXCold Start queries

focused on:nil entities

related to target clusterand/or

causality of event

Pool top-K filtered docs, or use each KBA run as separate KBP input.

(1000x filter)

Must coordinate choice of KBA target entities with

desired content of KBs for Cold Start queries.

KBA Stream Corpus 2012 (or the new Stream Corpus 2013)•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

KBA Stream Corpus 2012 (or the new Stream Corpus 2013)•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

Output KB

Output KB

Clusters of related entities

and/orevent-type

entities

Clusters of related entities

and/orevent-type

entities

KBP

KBA

Sponsors: Thank You.

Diffeo

Thanks for your time.

John R. Frankjrf@mit.edu

http://trec-kba.org

top related