alexander kotov, chengxiang zhai, richard sproat university of illinois at urbana-champaign

25
Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Upload: alec-raef

Post on 29-Mar-2015

251 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web

News StreamsAlexander Kotov, ChengXiang Zhai, Richard

Sproat

University of Illinois at Urbana-Champaign

Page 2: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 3: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

MotivationWeb data is generated by a large number of

textual streams (news, blogs, tweets, etc.)Bursts of entity mentions (people, locations)

correspond to a particular eventBursts of entity mentions are influenced by

bursts of other entities

Intuition: bursts of semantically related entities should be temporally correlated

Page 4: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Problem definition

time

13

25

31

46

9 8

3

96

21

21

15

14 1

0

13

12

6

11

10

457 8

54 3 2

𝑡 0 𝑡𝑇

2 13 2

11 7

24 3

5

1 2

63

time

𝑡 0 𝑡𝑇

sparsity

magnitude

time lag

entity 1

entity 2

=

?

Page 5: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Temporally correlated bursts

Problem: given a collection of textual streams discover named entities with correlated bursts

Provide multilingual summaries of real life events

Estimate social impact of a particular event in different countries

Differentiate between local and global eventsDiscover transliterations of named entities

Page 6: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 7: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Previous workBurst detection:

infinite-state automation (Kleinberg ’02)factorial HMMs (Krause ‘06)wavelet transformation (Zhu ’03)

Stream correlation: distance-based measures: Pearson coefficient

(Chien’05)singular spectrum transformation (Ide’05)topic based (PLSA, LDA) (Wang’09)

Page 8: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Previous work

Smoothing is efficient for large amount of data, but not precise

Do not abstract away from the raw dataDistance based measures suffer from

magnitude and sparsity problemsTemporal lags are not considered

Page 9: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 10: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Approach

Difference in magnitude: normalization with Markov Modulated Poisson Process

Temporal lag: flexible alignment of bursts using dynamic programming

Page 11: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Markov-Modulated Poisson Process

• Ergodic Markov chain over finite number of states

• Each state is associated with Poisson distribution

• “Burstiness’’ of a state is represented by the intensity parameter of Poisson distribution

• States are labeled by the rank of the intensity parameter

Page 12: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Normalization

time

25

31

46

9 8

3

96

21

21

15

14 1

0

13

12

6

11

10

457 8

54 3 2

1 1 1 1 1 1 2 2 2 2 2 1 1 1 3 3 3 3 3 3 2 1 1 1 13 3 3 31

2 13 2

13 1

1 7

24 3

5

1 2

63

time

21 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 1 1 2 1 1 12 21

mention counts

MMPP states

Page 13: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Normalization

• MMPP consistently outperforms the baseline• The optimal performance is achieved when the

number of states is 3

Page 14: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Burst AlignmentInput: -pair of normalized MC streams of length - threshold for ``bursty’’ states; - reward constant; - penalty function.Output: a table :

Page 15: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Burst alignment

perfect alignement

exponential penalty

logarithmic penalty

Page 16: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Burst alignment

• quadratic penalty function in combination with reward constant of 2 is optimal•maximum permitted temporal gap is 1 day

Page 17: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 18: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Dataset

News data crawled from RSS feeds over 4 month

Basic named entity recognitionBasic stemming

Page 19: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Correlated Bursts

Pattern 1: World Economic Forum in Davos, Switzerland and death of actor Heath Ledger;Pattern 2: death of Bobby FischerPattern 3: assassination of Benazir BhuttoPattern 4: French bank major trading loss incident and death of George Habash

Real life events:

Page 20: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Mining transliterationsStatic aligned corpora:

+ identical or semantically related contents + temporal topical alignment - limited coverage

Web: + covers almost any domain - difference in burst magnitude - temporal lag between bursts

Page 21: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Transliteration

•MMPP+DP outperforms one baseline (CS) in all entropy categories and the other baseline (PC) for low- and medium-entropy (more “bursty’’) entities;• Combination of MMPP+DP performs better than MMPP alone.

Page 22: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 23: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Summary

Novel multi-stream text mining problemOur approach can effectively discover

correlated bursts corresponding to major and minor real life events

Effective for unsupervised discovery of transliterations

Method is data independent and not limited to textual domain

Page 24: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Contributions

First method to use MMPP for burst detection in textual streams

Algorithm for temporally flexible stream correlation based on bursts

Unsupervised method for language-independent transliteration without any linguistic knowledge

Page 25: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Future work

Applying proposed method to non-textual data (e.g., sensor streams)

Burst correlations between entities different types of Web 2.0 data (news and tweets, news and blogs, news and tags, etc.)