automatically tracking events in the news · automatically tracking events in the news ... joint...
TRANSCRIPT
1
Automatically Tracking Events in
the News
Kathleen McKeown
Department of Computer Science
Columbia University
2
Vision
Generating presentations that connect
Events
Opinions
Personal accounts
Their impact on the world
3
Two Tasks
Monitoring events over time
Predicting their impact on financial
markets
Joint work with Tony Jebara and David
Yao
4
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
5
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
What data is available for learning?
6
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
What features yield good predictions?
7
Two Tasks
Monitoring events over time
Predicting their impact on financial
markets
Joint work with Tony Jebara and David
Yao
8
Monitor events over time
Input: streaming data
News, social media, web pages
At every hour, what’s new
9
10
11
Text Compression
12
Data
NIST evaluation on temporal
summarization
hourly web crawl
October 2011 - February 2013
16.1TB
Different categories of disaster
Climate, man-made, social unrest
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Temporal Summarization Approach
At time t:
1. Predict salience for input sentences Disaster-specific features for predicting
salience
2. Remove redundant sentences
3. Cluster and select exemplar sentences
for t Incorporate salience prediction as a prior
Kedzie & al, Bloomberg Social Good
Workshop, KDD 2014
Kedzie & al, ACL 2015
27
28
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
29
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
Why is the number of capitalized words
important?
30
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
31
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
32
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
Why are synonyms, hypernyms and
hyponyms important?
33
Predicting Salience: Model Features
Basic sentence level features
sentence length
punctuation count
number of capitalized words
number of event type synonyms, hypernyms, and hyponyms
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
34
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles)
domain specific corpus (disaster related Wikipedia
articles)
What does a generic language model
capture?
What does a domain specific language
model capture?
35
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles)
domain specific corpus (disaster related Wikipedia
articles)
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
36
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles)
domain specific corpus (disaster related Wikipedia
articles)
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
37
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles)
domain specific corpus (disaster related Wikipedia
articles)
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
38
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
Geographic Features
tag input with Named-Entity tagger
get coordinates for locations and mean distance to event
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
39
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
Geographic Features
tag input with Named-Entity tagger
get coordinates for locations and mean distance to event
High Salience
Nicaragua's disaster management said it had issued a local
tsunami alert.
Medium Salience
People streamed out of homes, schools and oce buildings as
far north as Mexico City.
Low Salience
Add to Digg Add to del.icio.us Add to Facebook Add to
Myspace
40
Predicting Salience: Model Features
Basic sentence level features
Language Models (5-gram Kneser-Ney model)
Geographic Features
Temporal Features measuring burstiness of words
How might we measure the burstiness of
words?
41
Determining Redundancy
Use a semantic similarity metric
Discard sentences with similarity to
previous sentences
42
43
44
45
46
47
SubEvent Identification
➔Decompose articles on a main event into
related sub-events:
Manhattan Blackout Breezy Point fire Public Transit
Outage
Hurricane
Sandy
48
49
Two Tasks
Monitoring events over time
Predicting their impact on financial
markets
Joint work with Tony Jebara and David
Yao
50
COLUMBIA DATA SCIENCE INSTITUTE Can we predict the effect that a particular event – such as extreme weather or political activity -- would have on financial markets?
Machine Learning
Natural Language
Processing
Financial and Economic Indices
Data Science
Take market financial data and traditional news feeds.
Use Natural Language Processing to transform raw data into structured event streams.
Apply Machine Learning tools to uncover previously hidden relationships between events and market behavior.
51
Fin
anci
al D
ata
New
s Feeds
Infe
rence
Bayesi
an N
etw
ork
Str
uct
ure
Dis
coveryE
ventt
Str
eam
s
Predicting Market Impact
52
NLP – Event Features
Binary features
calculated from news
Event type:
11 possible categories
Event location:
US or World
Sampled daily and hourly
53
Financials, Macroeconomics
Financial market indices (65)
• Stock market (15 broad + 27 sector)
• Bond market (8)
• Volatility index (6)
• Commodities (9)
Macroeconomic indicators (8)
• real GDP (2)
• CPI (1)
• PPI (1)
• Income (1)
• Consumption (1)
• Employment/Unemployment situation (2)
Index/indicators selection
54
Transform and Binarize
Compute relative change in indices
55
Equal frequency discretization (S&P 500)
56
ML – Structure Learning
To learn structure we assume samples are drawn iid from
unknown pairwise binary graphical model (no hidden
variables)
Training data: 2009-2014 Newsblaster data and financial
indicators
Use method of Ravikumar, Wainwright, Lafferty [2010]
Asymptotically optimal given mild assumptions & regularizer l1
57
SCI TECH ECONOMIC
DISASTER
RVX VXD VXO
RVX – Russell 2000 Volatility
VXD – Dow Volatility
VXO – S&P 100 Volatility
58
Results
We can predict
relation between event
and market impact
with significantly higher
average test likelihood
for held-out test days
Daily observations are
orders of magnitude more
likely under our model…
Naïve Tree D-Tree Ours-53.7 -37.8 -34.4 -26.4
59
Impact
Learn from past events to predict the
impact of new events on financial markets
Identify new events as they occur in news
feeds
Graph modeling enables identification of
structural relationships between detected
events and financial events
60
Thank You!
The research presented here has been
supported in part by DARPA GRAPH,
and NSF.