Search, Exploration and
Analytics of Evolving Data
Nattiya Kanhabua
L3S Research Center
Hannover, Germany
The 1st Keystone Training School on
Keyword Search over Big Data
23 July 2015, Malta
Lecturer Education qualification
2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway
Thesis: “Time-aware Approaches to Information Retrieval”
2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand
Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges”
1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand
Project: “Software Process Enhancement and Control System”
Work experience 2011- now: Postdoc, L3S Research Center, Germany
05/2015: Visiting researcher, University of Trento, Italy
03-05/2010: Research intern, Yahoo! Research, Spain
2007 - 2011: Temporary Scientific Staff, NTNU, Norway
2006 - 2007: Research assistant, University of Trento, Italy
06-10/2006: Research assistant, AIT, Thailand
2005 - 2006: Analyst programmer, IFDS Group, UK
2002 - 2003: Research assistant, Kasetsart University, Thailand
2001 - 2002: System analyst, Accenture, Thailand & Singapore
Skills • 7+ years of research experience in information
retrieval, data mining, machine learning, predictive
methods and spatio-temporal analysis
• 3+ years of research experience in BigData, e.g., large-
scale processing and MapReduce
Hadoop Pig Mahout HBase
Tomcat Servlet Lucene MySQL
Python JAVA JSP PHP
Weka R UML JSON
Eclipse NLP RDF WARC
2 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
9:00 – 10:30 Part I
Introduction to Temporal Dynamics
Temporal Information Extraction
Temporal Query Analysis (I)
10:30 – 11:00 Coffee break
11:00 – 12:30 Part II
Temporal Query Analysis (II)
Time-aware Retrieval and Ranking
Applications of Temporal IR
Conclusions and Outlook
3
Schedule
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Additional Resource
Book: Temporal Information Retrieval
Foundations and Trends® in Information Retrieval
Volume 9, Issue 2, pp 91-208, 2015
Download: http://goo.gl/TunlBb
References can be found in the book
4 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Introduction to Temporal Dynamics
What are temporal dynamics?
Why do they occur and impact search?
When and how to leverage temporal information for IR?
5 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
6
Temporal Dynamics
Figure: Internet Growth/Usage Phases/Tech Events
(created by Mark Schueler, used with permission)
23 July 2015
Temporal Web Dynamics Web is changing over time in many aspects, e.g., size, content,
structure and how it is accessed by user interactions or queries.
Size: web pages are added/deleted at all time
Content: web pages are edited/modified
Query: users’ information needs changes
[Risvik et al., CN 2002; Ke et al., CN 2006] [WebDyn 2010; Dumais, SIAM-SDM 2012]
7 23 July 2015
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!
1995 2015
Web and Index Sizes
8 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters! 2004
Index grows to
4.2 billion pages
1995 2015
9
Web and Index Sizes
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters! 2004
Index grows to
4.2 billion pages
1995 2015
2008
Google counts
1 trillion
unique URLs
10
Web and Index Sizes
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters! 2004
Index grows to
4.2 billion pages
1995 2020
2009
TBs or PBs of data/index
Tens of thousands of PCs
2008
Google counts
1 trillion
unique URLs
11
?
Web and Index Sizes
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
http://www.worldwidewebsize.com/ 12
Web and Index Sizes
23 July 2015
Content Change
The content of the Web, changes constantly over time, e.g., web
documents are added, modified or deleted continuously.
National and international initiatives collect and preserve parts of
the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013]
Figure: WayBack Machine
a web archive search tool by
Internet Archive
13 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content Change
Challenge:
Document representation and retrieval
14 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Categorization of Content Change
15
Implication:
Crawling, Indexing, Ranking
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
User Interaction Dynamics
Browsing and querying (or search) behavior
User preference, e.g., likes, comments, interests
User’s profiles [Rybak et al., ECIR 2014]
16 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Query Popularity Change
Challenge:
Time-sensitive queries
Query understanding and processing
Google Insights for Search: http://www.google.com/insights/search/
Query: Halloween
17 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Categorization of Web Search Queries
http://www.google.com/insights/search 18
Implication:
Query Analysis, Ranking
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Information Extraction
(1) Document Creation Time
(2) Document Focus Time
(3) Entity and Event Evolution
19 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Motivation
Incorporating time into search can increase retrieval effectiveness
Only when temporal information is available
Research problem:
How to determine the publication of a document?
How to extract temporal information from document contents?
20 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Two Time Aspects 1. Publication or modified time
Task: determining timestamps of documents
Method: rule-based technique, or temporal language models
2. Content or focus time
Task: temporal information extraction
Method: natural language processing, or time and event recognition
algorithms
21 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
content time
publication time
22 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Problem Statement: Hard to find trustworthy time for a web page
Time gap between crawling and indexing
Decentralization and relocation of web documents
No standard metadata for time/date
23
Determining Document Creation Time
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Problem Statement: Hard to find trustworthy time for a web page
Time gap between crawling and indexing
Decentralization and relocation of web documents
No standard metadata for time/date
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
24
Determining Document Creation Time
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Problem Statement: Hard to find trustworthy time for a web page
Time gap between crawling and indexing
Decentralization and relocation of web documents
No standard metadata for time/date
Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
25
Determining Document Creation Time
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Current Approaches
1. Content-based
Temporal language model [de Jong et al., AHC 2005;
Kanhabua and Nørvåg, ECDL 2008]
Classifier using features based on text’s time expressions
[Chambers, ACL 2012;Ge et al., EMNLP 2013]
Using burstiness of terms for estimating timestamps
[Kotsakos et al., SIGIR 2014]
2. Non content-based
Finding the oldest version of a page in a web archive [Jatowt
et al., WIDM 2007]
Leveraging external resources [Hauff and Azzopardi, ECIR
2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson,
TempWeb 2013]
26 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Temporal Language Models
Based on the statistic usage of
words over time
Compare each word of a non-
timestamped document with a
reference corpus
Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
27 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Temporal Language Models
Based on the statistic usage of
words over time
Compare each word of a non-
timestamped document with a
reference corpus
Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
28
tsunami
Thailand
A non-timestamped
document
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Temporal Language Models
Based on the statistic usage of
words over time
Compare each word of a non-
timestamped document with a
reference corpus
Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
29
tsunami
Thailand
A non-timestamped
document
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Temporal Language Models
Based on the statistic usage of
words over time
Compare each word of a non-
timestamped document with a
reference corpus
Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
30
tsunami
Thailand
A non-timestamped
document
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Temporal Language Models
Based on the statistic usage of
words over time
Compare each word of a non-
timestamped document with a
reference corpus
Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
31
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Normalized Log-likelihood Ratio
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models Normalized log-likelihood ratio
[Kraaij, SIGIR Forum 2005]
Variant of Kullback-Leibler divergence
Similarity of a document and time partitions
C is the background model estimated on the corpus
Linear interpolation smoothing to avoid the zero probability of unseen words
Freq
1
1
1
1
1
1
32
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
[Kanhabua et al., ECDL 2008] (Slide provided by the authors) 33 23 July 2015
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
34 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: A term weight depends on how good the term is
for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
35 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Semantic-based Preprocessing
36
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Semantic-based
Preprocessing
Description
Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives
Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”
Word sense
disambiguation
Identify the correct sense of a word from context, e.g. “bank”
Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”
have the common concept of “disaster”
Word filtering Select the top-ranked words according to TF-IDF scores for a comparison
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
37
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
38
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
(b)(a)
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
39
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
An inverse partition
frequency, ipf = log N/n
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Temporal Entropy
A measure of temporal information which a word conveys.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
Tells how good a term is in separating a partition from others.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
The higher temporal entropy a term has, the better
representative of a partition.
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
40 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
41 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
42
Np is the total number of
partitions in a corpus
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
43
Np is the total number of
partitions in a corpus
A probability of a partition
p containing a term wi
23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Non Content-based Approaches
Dating a document using its neighbors
1. Web pages linking to the document
I.e., incoming links
2. Web pages pointed by the document
I.e., outgoing links
3. Media assets associated with the document
E.g., images
Averaging the last-modified dates of its neighbors as timestamps
44 [Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Non Content-based Approaches
Drawbacks:
Rely on the availability and accuracy of other information
Cover only pages from most recent years
Cannot determine the age of the actual contents
45 [SalahEldeen and Nelson, 2013] 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Determining Document Focus Time
Three types of temporal expressions
1. Explicit: time mentions being mapped directly to a time point or
interval, e.g., “July 4, 2012”
2. Implicit: imprecise time point or interval, e.g., “Independence Day
2012”
3. Relative: resolved to a time point or interval using other types or
the publication date, e.g., “next month”
Time and event recognition [Mani and Wilson, ACL 2000]
A mix of hand-crafted and machine-learnt rules
Ranking the most relevant temporal expressions [Strötgen et al.,
TempWeb 2012]
46 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Time Taggers for Calculating Focus Time
HeidelTime:
http://heideltime.ifi.uni-
heidelberg.de/heideltime
Timestamp:
2013/7/15
23 July 2015 47 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Document may lack any temporal expressions
Temporal expressions may be weakly related to document’s
theme
Temporal taggers are not perfect
Limitations
Estimating document focus time
without using temporal expressions
23 July 2015 48 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Focus Time of Documents
Def. A document has focus time t if its content refers to t
23 July 2015 49 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Focus time: Concept
Use time-referenced documents for estimating focus time of
target document
A-1935------May 2011----C------
News Article
Collections
---A------2012--
---B-- 1978----
-1915-------------C—B-----A---
--1948-----------C-----2003--
-----A—B--C---A-
----
Target
Document
Target document
focus time
+
... ...
23 July 2015 50 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Word Graph
Word co-occurrence graph from large collections of news articles
Link weight estimated by Jaccard coefficient using sentence as unit
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima
23 July 2015 51 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Direct Word-Year Association
Word-year associations derived from graph
Word w is strongly associated with year y if
if it frequently co-occurs with y
A(war, 1900)
A(war, 1901)
…
A(war, 1944)
A(war, 1945)
…
A(war, 2009)
A(war, 2010)
A(hiroshima, 1900)
A(hiroshima, 1901)
…
A(hiroshima, 1944)
A(hiroshima, 1945)
…
A(hiroshima, 2009)
A(hiroshima, 2010)
A(word, 1900)
A(word, 1901)
…
A(word, 1944)
A(word, 1945)
…
A(word, 2009)
A(word, 2010)
23 July 2015 52 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Word w is strongly associated with year y if many other words that
frequently co-occur with w are also strongly associated with y
Second Level Term-Year Association
V
j
jdiriji ywAwwAV
ywA1
2,,
1,
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima
israel
23 July 2015 53 [Jatowt et al., CIKM 2013](Slide provided by the authors)
If a document contains many words strongly associated with year y,
the document is strongly associated with y
Estimating Document-Year Association
1900 1920 1940 1960 1980 2000
word A
word B
A(word,year)
word C
A + 2B + 2C
Time
A B C
B C
Document
Document-year association
23 July 2015 54 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Finding Discriminative Features
Not every word is useful for estimating text focus time E.g., “man”, “city” have stable associations with years
Temporal entropy – measure of variability of word associations
Temporal kurtosis – measure of peakness of word associations E.g., “war”, “earthquake” vs. “hitler”, “stalingrad”
1900 1920 1940 1960 1980 2000
word A
word B
Temporal_Entropy(A) < Temporal_Entropy(B)
A(word,year)
1900 1920 1940 1960 1980 2000
word A
Temporal_Kurtosis(A) > Temporal_Kurtosis(B)
A(word,year)
word B
Temporal entropy and Temporal kurtosis
used as temporal weights for words
23 July 2015 55 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Importance of Words in Document
Words weakly related to document theme should be skipped
TextRank 0.90 independence
0.82 poland
0.74 war
0.61 nazi
0.56 hitler
0.54 ….
President Obama took part in the
celebrations of the Polish
Independence Day. The US
president met main Polish
politicians in Warsaw.
Poland regained independence at
the end of the World War I
following Bolshevik Revolution.
It then lost the independence as a
result of Nazi and Soviet invasions
led by Hitler and Stalin.
Poland is located in East Europe.
Target Document
Document to
graph conversion
independence
poland war
hitler
…
…
…
…
…
TextRank scores used as discriminatory semantic weights for words
[Mihalcea and Tarau, EMNLP 2004]
23 July 2015 56 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Focus Time
1900 1920 1940 1960 1980 2000
word A
word B
word C
Weighted sum (temporality and
semantics)
Focus time: Interval based
threshold
Time
A(word,year)
1900 1920 1940 1960 1980 2000
A B C
B C
Document
Focus time: Instant based
1900 1920 1940 1960 1980 2000
23 July 2015 57 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Combined Approach
Combining estimated focus time and temporal expressions in text
Representing dates on timeline - Gaussian Kernel Density Estimate
Mixture of Gaussian distributions with means centered on extracted
dates
ydSydSydS TempExpEstComb ,,,
---1935------------2011-------------------------------------------1932-------------1940---------------------1932-----2001--
-------------
1932 1935 1940 2001 2011
Target document
23 July 2015 58 [Jatowt et al., CIKM 2013](Slide provided by the authors)
News articles collected from Google News Archive using country
names as queries
Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)
Published within [1990, 2010]
Dates falling in [1900, 2013] were found using regular expressions
Experimental Settings: Word Graphs
23 July 2015 59 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Settings: Test Datasets
Datasets on events related to countries:
Wiki: 250 Wikipedia pages about events
Books: 735 paragraphs from 2 text books about history (timelines)
Web: 812 paragraphs from web pages on history (BBC timelines,
etc.)
Datasets total #doc
avr. #sent
avr. time span of events
avr. year of events
avr. #dates
Wiki 250 179 3.4 years 1958 14.5
Book 735 43 4.4 years 1982 4.5
Web 819 18.3 1.3 years 1957 2.4
23 July 2015 60
Experimental Settings: Baselines
Baselines:
Random
Date-based (using only dates in document text)
LDA-based
1. 100 topics over sentences containing year mentions
2. Finding topic distribution of each year
3. Calculating document-year association based on topic distribution
of documents
23 July 2015 61 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Settings: Measurements
Measures:
Average error (in years)
Pearson Correlation Coefficient between ground truth years and
years in focus time
Ground truth
Estimated
focus time
Ground truth
Estimated
focus time
tfocus - + + - - + - - +
Average error (years) for
instant-based representation Correlation measure (-1..+1) for
interval-based representation
error + + - - + + - - +
23 July 2015 62 [Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Results
Datasets random baseline
LDA baseline
date-based baseline
Proposed (no dates)
Proposed combined
(with dates)
Wiki 36.5 27.2 3 18.3 2.83
Books 39.3 37.3 48.1 23.5 20.4
Web 40.5 41.4 53.4 23.6 20.7
Datasets random baseline
LDA baseline
date-based baseline
Proposed (no dates)
Proposed combined
(with dates)
Wiki 0 0.1 0.65 0.29 0.66
Books 0 0.04 0.01 0.25 0.30
Web 0 0.02 -0.03 0.26 0.41
Average error
Pearson Correlation Coefficient
23 July 2015 63 [Jatowt et al., CIKM 2013](Slide provided by the authors)
How well can we estimate focus time of documents about
distant past ?
Effect of Time Distance on Focus Time
Wiki Books
Web Instant-based
focus time representation
23 July 2015 64 The 1st Keystone Summer School: Keyword Search
over Big Data
Question?
65 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Query Analysis
(1) Temporal query intent
(2) Dynamic query subtopics
66 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Queries
Temporal information needs
Searching temporal document collections
E.g., digital libraries, web/news archives
Users: historians, librarians, journalists or students
Temporal queries exist in both standard collections and the Web
Relevancy is dependent on time
Documents are about events at particular time
67 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Types of Temporal Queries
Two types of temporal queries 1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
Temporal intent can be implicitly inferred
I.e., refer to the World Cup event in 2006
Studies of web search query logs show a significant fraction
of temporal queries
1.5% of web queries are explicit
~7% of web queries are implicit
13.8% of queries contain explicit time and 17.1% of queries have
temporal intent implicitly provided
68 [Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010] 23 July 2015
Figure: Variances of
temporal queries and
their dynamics
23 July 2015 69 The 1st Keystone Summer School: Keyword Search
over Big Data
Understanding Temporal Query Intent
Current approaches:
1. Mining temporal patterns in query logs
2. Analyzing top-k search results
70
[Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012]
[Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015
Motivation
Temporal queries are a significant fraction of Web
search queries [Zhang et al., EMNLP 2010]
13.8% of explicit temporal queries
17.1% of implicit temporal queries
Characteristics:
Certain temporal patterns, i.e., spikes, periodicity
(hourly or daily), seasonality and trends
Underlying temporal information needs without
temporal patterns observed
Tasks:
Understand temporal search intent
Enable advanced enhancement techniques
Automatic method for detecting events in search streams
US Election
2016
Brazil FIFA
World Cup
23 July 2015 71 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Preliminaries
Data model:
Set of queries Q issues at different time points
Set of clicked URLs U and click-through data
Temporal document collection D
q: keywords or term(q), and hitting time(q)
yq: time series data extracted form Q, U and D
Two-step approach:
Automatically extract a set of candidate queries {q1, ..., qn} from Q
Classify candidates as event-related queries {e1, ..., em} using
machine learning techniques
23 July 2015 72 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Identifying Event Candidates
Time and keyword-based clustering:
Step1: Partition query logs into one week
• Group queries from the same event
• Possibly contain multiple, unrelated events
Step2: Cluster queries by lexical similarity
• Pre-process and sort queries alphabetically
• Compute Jaccard similarity of a query pair
Easter - easter 2006, easter 2007, easter 20crafts,
easter activities, easter animation, easter animations,
easter background, easter basket, easter bread,
easter bucket, easter bunny, easter bunny decorations,
easter bunny lights
23 July 2015 73 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Event-related Query Classification
Classify a query as event-related or not:
Periodic and seasonal events
Popular and trending events
Sporadic (rare) and unseen events
General time-sensitive queries
Underlying temporal information needs
Features:
Time-series features, e.g., seasonality or trends
Popularity-based features, e.g., click-through and burstiness
Statistic features, e.g., probability distribution of results
temporal KL-divergence and skewness (kurtosis)
23 July 2015 74 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Query: Easter
Seasonality
Query: World cup
Detect seasonal queries [Shokouhi, SIGIR 2011]
E.g., Annual events, e.g., US Open and Easter,
or a 4-year recurring event, e.g., FIFA World Cup
Method: time-series decomposition using Holt-
Winters adaptive exponential smoothing
Input: time-series data extracted from external
document collections, YD
Compute a cosine similarity as seasonality
Y is the original time-series data
S is the seasonality component
23 July 2015 75 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Autocorrelation
Detect trending events by their predictability
Cross correlation with itself or between its
past and future values at different time lags
The stronger inter-day dependencies, the
higher value for autocorrelation
where lag=1, shifting the 2nd time series by
one day, called 1st-order autocorrelation
23 July 2015 76 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Temporal KL-divergence
Analyze a temporal distribution in a result set
Measure the difference between the distribution over time
of top-k documents of q and the document collection C
P(t|q) is the probability of generating a publication date t
given q
P(t|C) is the probability of a publication date t in the
collection
23 July 2015 77 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Surprise Score
Detect unseen events or surprisingly popular
queries [Radinsky et al. , WWW 2012]
Assume an unplanned event happening when there is
a significant prediction error
Compute the sum of squared errors of prediction
(SSE) using a simple linear regression model
23 July 2015 78 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experiments
Query logs:
• Two datasets, i.e., AOL and MSN
• AOL: 30M queries March 1 - May 31, 2006
• MSN: 15M queries from May 2006
Temporal collection:
• The New York Times Annotated Corpus
• 1.8M documents from 1987 - 2007
Setting:
• HeidelTime for time extraction and OpenNLP for entity extraction
• Cleansing-step parameters: Jaccard similarity threshold>0.2; edit
distance<3; overlap n-gram=2
• For burstiness features, default parameters for the burst detection
technique provided by CISHELL
In total, 837 event-related queries
23 July 2015 79 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experimental Results (I) Feature selection: • Study high-impact (best) features
• Investigate their importance independent from classification algorithms
• InfoGainAttributeEval method in WEKA
Main findings: • Discriminative features are mostly derived
from D and Q
• TemporalKL and kurtosis are among influential features
• Trend-based features, such as, autocorrelation, burst weight, and trending level, play an important role
• Seasonality computed from Q has less impact than the one extracted D
23 July 2015 80 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experimental Results (II)
Query classification:
• Several classifiers, i.e., support vector
machine (SVM), AdaBoost, decision tree
(J48), and neural network (NN)
• Metrics: accuracy, precision, recall, F-
measure using 10-fold cross validation
Main findings:
• J48 is the best performing algorithm
• TemporalKL achieves accuracy of 84%
• Adding autocorrelation, kurtosis, and
seasonality increases the performance
• However, the performance has dropped
after adding max. query frequency, so on
23 July 2015 81 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Analyzing Top-k Search Results
Using temporal language models
Determine time of queries when no time is given explicitly
Re-rank search results using the determined time
Exploiting time from search snippets
Extract temporal expressions (i.e., years) from the contents of top-k
retrieved web snippets for a given query
Content-based language-independent approach
82 [Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015
Determining Time of Queries
Approach I. Dating using keywords*
Approach II. Dating using top-k documents*
Queries are short keywords
Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
83 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
I. Dating using Keywords
84 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
I. Dating using Keywords
85
Query’s temporal
profiles
23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
II. Dating using Top-k Documents
86 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
II. Dating using Top-k Documents
87
Query’s temporal
profiles
23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
III. Using Timestamp of Documents
88 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
III. Using Timestamp of Documents
89
Query’s temporal
profiles
23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
90 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
Intuition: documents published closely to the time of queries are
more relevant
Assign document priors based on publication dates
Intuition: documents published closely to the time of queries are
more relevant
Assign document priors based on publication dates
Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
91 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
march madness
began
14/03/2006
ncaa women
tournament began
18/03/2006 01/04/2006
final four began
query: ncaa
Change of Query Subtopics over Time
92 [Nguyen and Kanhabua, ECIR 2014] 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Mining Temporal Anchor Texts
Anchor texts are complementary description
for target pages, widely used to improve search
Characteristics:
Short summary (a few words) of target pages
Collective wisdom of people other than authors
Similar behavior to real-world queries and titles
Capturing aboutness or what a document is about
Main ideas:
Temporal anchor texts mined from the edit history of
Wikipedia as a hook for tracking entity evolution
Large-scale analysis and a more robust discovery of
evolving information using limited resources
23 July 2015 93 The 1st Keystone Summer School: Keyword Search
over Big Data
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using the one-month granularity
2. For each Wikipedia snapshot, identify named entity articles/pages
3. Extract anchor texts from all articles linking to an entity page
4. Rank aggregated entity-anchor relationships at a particular time t
[Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94 The 1st Keystone Summer School: Keyword Search
over Big Data
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot,
identify named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
23 July 2015 95 The 1st Keystone Summer School: Keyword Search
over Big Data
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot, identify
named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
President_of_the_
United_States
President
Bush (43)
Time:
10/2005 Barack
Obama
Time:
11/2008
George
W. Bush
Time:
11/2004 23 July 2015 96 The 1st Keystone Summer School: Keyword Search
over Big Data
1. Multi-word title with all words capitalized,
except prepositions, determiners, etc.
E.g., President_of_the_United_States => entity
2. Single-word titles with multiple capital
letters
E.g., UNICEF and WHO => entities
3. 75% of occurrences in the article text itself
are capitalized (not beginning of sentence)
Recognizing Named Entity Articles
[Bunescu and Pasca, EACL 2006] 23 July 2015 97 The 1st Keystone Summer School: Keyword Search
over Big Data
Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 98 The 1st Keystone Summer School: Keyword Search
over Big Data
Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 99 The 1st Keystone Summer School: Keyword Search
over Big Data
Experiments
Data collection:
• A dump of English Wikipedia edit history (2.8 TB)
• All pages and revisions 03/2001 to 03/2008
• 85 snapshots + 4 additional snapshots
(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)
Tools:
• Preprocess/store revisions using MWDumper
http://www.mediawiki.org/wiki/Mwdumper
• Store anchor texts: mySQL databases
23 July 2015 100 The 1st Keystone Summer School: Keyword Search
over Big Data
Top-100 Named Entities
23 July 2015 101 The 1st Keystone Summer School: Keyword Search
over Big Data
Top-100 Named Entities
23 July 2015 102 The 1st Keystone Summer School: Keyword Search
over Big Data
Top-100 Named Entities
23 July 2015 103 The 1st Keystone Summer School: Keyword Search
over Big Data
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
1. Senator Barack Obama
2. Illinois Senator Barack
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008 10/2008
23 July 2015 104 The 1st Keystone Summer School: Keyword Search
over Big Data
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
1. Senator Barack Obama
2. Illinois Senator Barack
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008
1. Senator Barack
Obama
2. Illinois Senator Barack
Obama
3. Barak Obama, U.S.
Senator, Illinois, 2008
Democratic nominee for
U.S. President
4. presidential
candidacy
announcement
1. President Barack
Obama
2. Senator Barack Obama
3. U.S. President Barack
Obama
4. 44th President of the
United States
5. Obama Administration
10/2008
23 July 2015 105 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 106 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 107 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 108 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 109 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 110 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 111 The 1st Keystone Summer School: Keyword Search
over Big Data
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 112 The 1st Keystone Summer School: Keyword Search
over Big Data
Question?
113 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Time-aware Retrieval and Ranking
(1) Recency-based Ranking
(2) Time-dependent Ranking
114 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
RECAP
Two time dimensions
1. Publication or modified time
2. Content or focus time
115 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Searching the past
Historical or temporal information needs
A journalist working the historical story of a particular news article
A Wikipedia contributor finding relevant information that has not been written about yet
116
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Query Examples
A temporal query consists of:
Query keywords
Temporal expressions
A document consists of:
Terms, i.e., bag-of-words
Publication time and temporal expressions
117 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Query Examples
[Berberich et al., ECIR 2010] 118 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Assign prior probabilities using an exponential function
E.g., a more recent creation date obtains high probability
Current approaches:
Time-based language model [Li and Croft, CIKM 2003]
Using retention functions [Peetz and de Rijke, ECIR 2013]
Incorporating freshness into web authority [Dai and Davison,
SIGIR 2010]
Recency-based Ranking
119 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Time must be explicitly modeled in order to increase the
effectiveness of ranking
To order search results so that the most relevant ones come first
Time uncertainty should be taken into account
Two temporal expressions can refer to the same time period even
though they are not equally written
E.g. the query “Independence Day 2011”
A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Time-dependent Ranking
120 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Time-dependent Ranking
Two main approaches:
1. Mixture model [Kanhabua et al., ECDL 2010]
Linearly combining textual- and temporal similarity
2. Probabilistic model [Berberich et al., ECIR 2010]
Generating a query from the textual part and temporal part
of a document independently
121 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Mixture Model
Linearly combine textual- and temporal similarity
α indicates the importance of similarity scores
Both scores are normalized before combining
Textual similarity can be determined using any term-based retrieval model
E.g., tf.idf or a unigram language model
122 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Mixture Model
Linearly combine textual- and temporal similarity
α indicates the importance of similarity scores
Both scores are normalized before combining
Textual similarity can be determined using any term-based retrieval model
E.g., tf.idf or a unigram language model
123
How to determine temporal similarity?
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Similarity
Sim
ilarity
score
Time
d1 d2 <q>
Dist(d1,q)
Dist(d2,q)
[Kanhabua et al., ECDL 2010]
23 July 2015 124 The 1st Keystone Summer School: Keyword Search
over Big Data
Temporal Similarity
Assume that temporal expressions in the query are generated
independently from a two-step generative model:
P(tq|td) can be estimated based on publication time using an
exponential decay function [Kanhabua et al., ECDL 2010]
Linear interpolation smoothing is applied to eliminates zero
probabilities
I.e., an unseen temporal expression tq in d
125 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Comparison of time-aware ranking
Five time-aware ranking models
LMT [Berberich et al., ECIR 2010]
LMTU [Berberich et al., ECIR 2010]
TS [Kanhabua et al., ECLD 2010]
TSU [Kanhabua et al., ECLD 2010]
FuzzySet [Kalczynski et al., Inf. Process. 2005]
126 [Kanhabua et al., SIGIR 2011] 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Experiment:
New York Times Annotated Corpus
40 temporal queries [Berberich et al., ECIR 2010]
Result:
TSU outperforms other methods significantly for most metrics
Conclusions:
Although TSU gains the best performance, but only applied to a
collection with time metadata
LMT, LMTU can be applied to any collection without time metadata,
but time extraction is needed
Discussion
127 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
128
Applications for Temporal IR
(1) Searching the Future
(2) Time-aware Recontextualization
23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Searching the Future
People are naturally curious about the future
What will happen to EU economies in next 5 years?
What will be potential effects of climate changes?
129 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Previous work
Searching the future
Extract temporal expressions from news articles
Retrieve future information using a probabilistic model, i.e.,
multiplying textual similarity and a time confidence
Supporting analysis of future-related information in news and
Web
Extract future mentions from news snippets obtained from search
engines
Summarize and aggregate results using clustering methods, but no
ranking
[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 130 23 July 2015
Recorded Future
http://www.recordedfuture.com/
131 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Yahoo! Time Explorer
[Matthews et al., HCIR 2010] 132 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Ranking News Predictions
Over 32% of 2.5M documents from Yahoo! News (July’09 –
July’10) contain at least one prediction
Retrieve predictions related to a news story in news archives and
rank by relevance
133 23 July 2015
Related News Predictions
[Kanhabua et al., SIGIR 2011] 134 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Related News Predictions
[Kanhabua et al., SIGIR 2011] 135 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Related News Predictions
[Kanhabua et al., SIGIR 2011] 136 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Four classes of features
Term similarity, entity-based similarity, topic similarity and temporal
similarity
Rank results using a learning-to-rank technique
Approach
23 July 2015 137 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Step 1: Document annotation.
Extract temporal expressions
using time and event recognition
Normalize them to dates so they
can be anchored on a timeline
Output: sentences annotated
with named entities and dates,
i.e., predictions
Step 2: Retrieving predictions.
Automatically generate a query
from a news article being read
Retrieve predictions that match
the query
Rank predictions by relevance
(i.e., a prediction is “relevant” if it
is about the topics of the article)
System Architecture
[Kanhabua et al., SIGIR 2011] 138 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
Capture the term similarity between q and p 1. TF-IDF scoring function
Problem: keyword matching, short texts
Predictions not match with query terms
2. Field-aware ranking function, e.g., bm25f
Search the context of a prediction, i.e., surrounding sentences
Term Similarity
139 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Measure the similarity between q
and p using annotated entities in
dp, p, q
Features commonly employed in
entity ranking
Entity-based Similarity
140 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Compute the similarity between q and p on topic Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for
modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic Similarity
141 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Compute the similarity between q and p on topic Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for
modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic Similarity
142 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity
143 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity
Hypothesis II. Predictions extracted from more recent documents
are more relevant
144 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Learning-to-rank: Given an unseen (q, p), p is ranked using a
model trained over a set of labeled query/prediction
SVM-MAP [Yue et al., SIGIR 2007]
RankSVM [Joachims, KDD 2002]
SGD-SVM [Zhang, ICML 2004]
PegasosSVM [Shalev-Shwartz et al., ICML 2007]
PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]
Ranking Method
145 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
42 future-related topics
Relevance Judgments
146 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
New York Times Annotated Corpus
1.8 million articles, over 20 years
More than 25% contain at least one prediction
Annotation process uses several language processing tools
OpenNLP for tokenizing, sentence splitting, part-of-speech tagging,
shallow parsing
SuperSense tagger for named entity recognition
TARSQI for extracting temporal expressions
Apache Lucene for indexing and retrieving.
44,335,519 sentences and 548,491 predictions
939,455 future dates (avg. future date/prediction is 1.7)
Experiments
147 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Results:
Topic features play an important role in ranking
Features in top-5 features with lowest weights are entity-based
features
Open issues:
Extract predictions from other sources, e.g., Wikipedia, blogs,
comments, etc.
Sentiment analysis for future-related information
Discussion
148 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data
[Kanhabua et al., SIGIR 2011]
Prior to 1964, many of the cigarette
companies advertised their brand by
falsely claiming that their product did not
have serious health risks. A couple of
examples would be "Play safe with Philip
Morris" and "More doctors smoke
Camels". Such claims were made both to
increase the sales of their product and to
combat the increasing public knowledge of
smoking's negative health effects.
Advertisement poster from the
1950s
Time-aware
contextualization
Time-aware Contextualization
23 July 2015 149 [Tran et al., WSDM 2015] (Slide provided by the authors)
Physician
http://en.wikipedia.org/wiki/Physician
Camel (cigarette)
http://en.wikipedia.org/wiki/Camel_(cigarette)
Cigarette
http://en.wikipedia.org/wiki/Cigarette
Entity linking is not sufficient
Wikipedia pages tend to contain large amounts of content
Relevant information might be distributed over various articles
The crucial temporal aspect is missing in pure linking approaches
Entity Linking
23 July 2015 150 [Tran et al., WSDM 2015] (Slide provided by the authors)
Problem Statement
23 July 2015 151
Time-aware contextualization aims to associate an information item
d with time-aware, concise and coherent context information c for
easing its understanding
Several sub-goals of the information search process have to
combined with each other
c has to be relevant for d
c has to complement the information already available in d
c has to consider the time of creation of d
the context information should be concise to avoid overloading the user
[Tran et al., WSDM 2015] (Slide provided by the authors)
User
Article Query
Formulation
Context
Ranking
Contextualization
units Index Context Context
Retrieval
Contextualization units
Extraction
Context
Hook
Identification
Approach Overview
23 July 2015 152 [Tran et al., WSDM 2015] (Slide provided by the authors)
The goal is to generate a set of queries for a given document to
retrieve candidates as input for the re-ranking step
We explore two families of query formulation methods
Document-based methods : title, lead, title+lead
Hook-based methods: each_hook, all_hooks, and query performance
prediction (qpp_r@k) with the following features
Linguistics features
Document frequency
Scope
Temporal document frequency
Temporal scope
Temporal similarity
Query Formulation
23 July 2015 153 [Tran et al., WSDM 2015] (Slide provided by the authors)
Context retrieval:
Learning to rank context:
• The ranking algorithm needs to balance two goals, i.e., high topical and
temporal relevance as well as complementarity for providing additional
information
• Use supervised machine learning that takes as input a set of labeled
examples and various complementarity features
Topic diversity
Text difference
Entity difference
Anchor text difference
Distributional similarity
Cosine distance
Relevance
Temporal similarity
Context Ranking
23 July 2015 154 [Tran et al., WSDM 2015] (Slide provided by the authors)
Experiments
23 July 2015 155
Datasets:
51 news articles from New York Times Corpus
Wikipedia (2013), 26 million contextualization units (paragraphs)
9464 manual labeled examples (article/context pairs)
Learning to rank algorithms: RankBoost, Random Forests and Adarank
Baselines
Entity linking (Milne and Witten)
Language model (LM)
Time-aware language model (LM-T)
[Tran et al., WSDM 2015] (Slide provided by the authors)
Evaluating Query Formulation Methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P@1 P@3 P@10 MAP
title+lead
all_hooks
qpp_r@100
Wikification technique
achieves a low recall of 0.229
Hook-based approaches
outperform the document-
based approaches
Query performance
prediction method obtains the
highest results on all metrics
[Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156
The Effect of Complementarity Features
0
0.2
0.4
0.6
0.8
1
P@1 P@3 P@10 MAP
LM-T
RF
Purely using the time dimension
in context retrieval is not sufficient
in the contextualization task
Complementarity plays an
important role in contextualization
23 July 2015 157 [Tran et al., WSDM 2015] (Slide provided by the authors)
Conclusions and Outlook
Introduced the general topic of web evolution.
Pinpointed a number of issues related to temporal IR.
Focused on temporal information extraction, temporal query
analysis, as well as time-aware retrieval and ranking.
Wrapped up with related applications to temporal IR.
Future directions:
Real-time web mining
Spatio-temporal search and analytics
Brain-inspired information access
23 July 2015 158 The 1st Keystone Summer School: Keyword Search
over Big Data
Thank you!
159 23 July 2015 The 1st Keystone Summer School: Keyword Search
over Big Data