Download - Search, Exploration and Analytics of Evolving Data

Search, Exploration and

Analytics of Evolving Data

Nattiya Kanhabua

L3S Research Center

Hannover, Germany

The 1st Keystone Training School on

Keyword Search over Big Data

23 July 2015, Malta

Lecturer Education qualification

2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway

Thesis: “Time-aware Approaches to Information Retrieval”

2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand

Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges”

1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand

Project: “Software Process Enhancement and Control System”

Work experience 2011- now: Postdoc, L3S Research Center, Germany

05/2015: Visiting researcher, University of Trento, Italy

03-05/2010: Research intern, Yahoo! Research, Spain

2007 - 2011: Temporary Scientific Staff, NTNU, Norway

2006 - 2007: Research assistant, University of Trento, Italy

06-10/2006: Research assistant, AIT, Thailand

2005 - 2006: Analyst programmer, IFDS Group, UK

2002 - 2003: Research assistant, Kasetsart University, Thailand

2001 - 2002: System analyst, Accenture, Thailand & Singapore

Skills • 7+ years of research experience in information

retrieval, data mining, machine learning, predictive

methods and spatio-temporal analysis

• 3+ years of research experience in BigData, e.g., large-

scale processing and MapReduce

Hadoop Pig Mahout HBase

Tomcat Servlet Lucene MySQL

Python JAVA JSP PHP

Weka R UML JSON

Eclipse NLP RDF WARC

2 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

9:00 – 10:30 Part I

Introduction to Temporal Dynamics

Temporal Information Extraction

Temporal Query Analysis (I)

10:30 – 11:00 Coffee break

11:00 – 12:30 Part II

Temporal Query Analysis (II)

Time-aware Retrieval and Ranking

Applications of Temporal IR

Conclusions and Outlook

3

Schedule

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Additional Resource

Book: Temporal Information Retrieval

Foundations and Trends® in Information Retrieval

Volume 9, Issue 2, pp 91-208, 2015

Download: http://goo.gl/TunlBb

References can be found in the book


over Big Data

http://goo.gl/TunlBb

Introduction to Temporal Dynamics

What are temporal dynamics?

Why do they occur and impact search?

When and how to leverage temporal information for IR?


over Big Data

6

Temporal Dynamics

Figure: Internet Growth/Usage Phases/Tech Events

(created by Mark Schueler, used with permission)

23 July 2015

Temporal Web Dynamics Web is changing over time in many aspects, e.g., size, content,

structure and how it is accessed by user interactions or queries.

Size: web pages are added/deleted at all time

Content: web pages are edited/modified

Query: users’ information needs changes

[Risvik et al., CN 2002; Ke et al., CN 2006] [WebDyn 2010; Dumais, SIAM-SDM 2012]

7 23 July 2015

2000

First billion-URL index

The world’s largest!

≈5000 PCs in clusters!

1995 2015

Web and Index Sizes


over Big Data

2000



≈5000 PCs in clusters! 2004

Index grows to

4.2 billion pages

1995 2015

9

Web and Index Sizes


over Big Data

2000




Index grows to

4.2 billion pages

1995 2015

2008

Google counts

1 trillion

unique URLs

10

Web and Index Sizes


over Big Data

2000




Index grows to

4.2 billion pages

1995 2020

2009

TBs or PBs of data/index

Tens of thousands of PCs

2008

Google counts

1 trillion

unique URLs

11

?

Web and Index Sizes


over Big Data

http://www.worldwidewebsize.com/ 12

Web and Index Sizes

23 July 2015

Content Change

The content of the Web, changes constantly over time, e.g., web

documents are added, modified or deleted continuously.

National and international initiatives collect and preserve parts of

the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013]

Figure: WayBack Machine

a web archive search tool by

Internet Archive


over Big Data

Content Change

Challenge:

Document representation and retrieval


over Big Data

Categorization of Content Change

15

Implication:

Crawling, Indexing, Ranking


over Big Data

User Interaction Dynamics

Browsing and querying (or search) behavior

User preference, e.g., likes, comments, interests

User’s profiles [Rybak et al., ECIR 2014]


over Big Data

Query Popularity Change

Challenge:

Time-sensitive queries

Query understanding and processing

Google Insights for Search: http://www.google.com/insights/search/

Query: Halloween


over Big Data

Categorization of Web Search Queries

http://www.google.com/insights/search 18

Implication:

Query Analysis, Ranking


over Big Data

Temporal Information Extraction

(1) Document Creation Time

(2) Document Focus Time

(3) Entity and Event Evolution


over Big Data

Motivation

Incorporating time into search can increase retrieval effectiveness

Only when temporal information is available

Research problem:

How to determine the publication of a document?

How to extract temporal information from document contents?


over Big Data

Two Time Aspects 1. Publication or modified time

Task: determining timestamps of documents

Method: rule-based technique, or temporal language models

2. Content or focus time

Task: temporal information extraction

Method: natural language processing, or time and event recognition

algorithms


over Big Data

content time

publication time


over Big Data

Problem Statement: Hard to find trustworthy time for a web page

Time gap between crawling and indexing

Decentralization and relocation of web documents

No standard metadata for time/date

23

Determining Document Creation Time


over Big Data





I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

24



over Big Data





Let’s me see…

This document is

probably

written in 850 A.C.

with 95% confidence.

I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

25



over Big Data

Current Approaches

1. Content-based

Temporal language model [de Jong et al., AHC 2005;

Kanhabua and Nørvåg, ECDL 2008]

Classifier using features based on text’s time expressions

[Chambers, ACL 2012;Ge et al., EMNLP 2013]

Using burstiness of terms for estimating timestamps

[Kotsakos et al., SIGIR 2014]

2. Non content-based

Finding the oldest version of a page in a web archive [Jatowt

et al., WIDM 2007]

Leveraging external resources [Hauff and Azzopardi, ECIR

2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson,

TempWeb 2013]


over Big Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1


over Big Data


Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake



words over time



reference corpus



usage

Freq

1

1

1

1

1

1

28

tsunami

Thailand

A non-timestamped

document


over Big Data


Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake



words over time



reference corpus



usage

Freq

1

1

1

1

1

1

29

tsunami

Thailand

A non-timestamped

document


over Big Data


Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake



words over time



reference corpus



usage

Freq

1

1

1

1

1

1

30

tsunami

Thailand

A non-timestamped

document


over Big Data


Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake



words over time



reference corpus



usage

Freq

1

1

1

1

1

1

31

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004


over Big Data

Normalized Log-likelihood Ratio

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Normalized log-likelihood ratio

[Kraaij, SIGIR Forum 2005]

Variant of Kullback-Leibler divergence

Similarity of a document and time partitions

C is the background model estimated on the corpus

Linear interpolation smoothing to avoid the zero probability of unseen words

Freq

1

1

1

1

1

1

32

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004


over Big Data

Improving Temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing

2. Search statistics to enhance similarity scores

3. Temporal entropy as term weights

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into

document preprocessing

[Kanhabua et al., ECDL 2008] (Slide provided by the authors) 33 23 July 2015






Intuition: Search statistics Google Zeitgeist (GZ) can

increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the

normalized log-likelihood ratio

34 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)






Intuition: A term weight depends on how good the term is

for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter


Semantic-based Preprocessing

36

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into

document preprocessing

Semantic-based

Preprocessing

Description

Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives

Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”

Word sense

disambiguation

Identify the correct sense of a word from context, e.g. “bank”

Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”

have the common concept of “disaster”

Word filtering Select the top-ranked words according to TF-IDF scores for a comparison

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Leveraging Search Statistics

37







38





(b)(a)



39





P(wi) is the probability that wi occurs:

P(wi) = 1.0 if a gaining query

P(wi) = 0.5 if a declining query

f(R) converts a ranked

number into weight. The

higher ranked query is

more important.

An inverse partition

frequency, ipf = log N/n


Temporal Entropy

Temporal Entropy

A measure of temporal information which a word conveys.

Captures the importance of a term in a document collection

whereas TF-IDF weights a term in a particular document.

Tells how good a term is in separating a partition from others.

A term occurring in few partitions has higher temporal entropy

compared to one appearing in many partitions.

The higher temporal entropy a term has, the better

representative of a partition.

Intuition: A term weight depends on how good the term

is for separating time partitions (discriminative)




Temporal Entropy






Temporal Entropy





42

Np is the total number of

partitions in a corpus


Temporal Entropy





43

Np is the total number of

partitions in a corpus

A probability of a partition

p containing a term wi


Non Content-based Approaches

Dating a document using its neighbors

1. Web pages linking to the document

I.e., incoming links

2. Web pages pointed by the document

I.e., outgoing links

3. Media assets associated with the document

E.g., images

Averaging the last-modified dates of its neighbors as timestamps

44 [Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Non Content-based Approaches

Drawbacks:

Rely on the availability and accuracy of other information

Cover only pages from most recent years

Cannot determine the age of the actual contents

45 [SalahEldeen and Nelson, 2013] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Determining Document Focus Time

Three types of temporal expressions

1. Explicit: time mentions being mapped directly to a time point or

interval, e.g., “July 4, 2012”

2. Implicit: imprecise time point or interval, e.g., “Independence Day

2012”

3. Relative: resolved to a time point or interval using other types or

the publication date, e.g., “next month”

Time and event recognition [Mani and Wilson, ACL 2000]

A mix of hand-crafted and machine-learnt rules

Ranking the most relevant temporal expressions [Strötgen et al.,

TempWeb 2012]


over Big Data

Time Taggers for Calculating Focus Time

HeidelTime:

http://heideltime.ifi.uni-

heidelberg.de/heideltime

Timestamp:

2013/7/15

23 July 2015 47 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Document may lack any temporal expressions

Temporal expressions may be weakly related to document’s

theme

Temporal taggers are not perfect

Limitations

Estimating document focus time

without using temporal expressions


Focus Time of Documents

Def. A document has focus time t if its content refers to t


Estimating Focus time: Concept

Use time-referenced documents for estimating focus time of

target document

A-1935------May 2011----C------

News Article

Collections

---A------2012--

---B-- 1978----

-1915-------------C—B-----A---

--1948-----------C-----2003--

-----A—B--C---A-

----

Target

Document

Target document

focus time

+

... ...


Word Graph

Word co-occurrence graph from large collections of news articles

Link weight estimated by Jaccard coefficient using sentence as unit

war

nazi

1945

1939

aushwitz

jews

germany

jalta

hiroshima


Estimating Direct Word-Year Association

Word-year associations derived from graph

Word w is strongly associated with year y if

if it frequently co-occurs with y

A(war, 1900)

A(war, 1901)

…

A(war, 1944)

A(war, 1945)

…

A(war, 2009)

A(war, 2010)

A(hiroshima, 1900)

A(hiroshima, 1901)

…

A(hiroshima, 1944)

A(hiroshima, 1945)

…

A(hiroshima, 2009)

A(hiroshima, 2010)

A(word, 1900)

A(word, 1901)

…

A(word, 1944)

A(word, 1945)

…

A(word, 2009)

A(word, 2010)


Word w is strongly associated with year y if many other words that

frequently co-occur with w are also strongly associated with y

Second Level Term-Year Association

V

j

jdiriji ywAwwAV

ywA1

2,,

1,

war

nazi

1945

1939

aushwitz

jews

germany

jalta

hiroshima

israel


If a document contains many words strongly associated with year y,

the document is strongly associated with y

Estimating Document-Year Association

1900 1920 1940 1960 1980 2000

word A

word B

A(word,year)

word C

A + 2B + 2C

Time

A B C

B C

Document

Document-year association


Finding Discriminative Features

Not every word is useful for estimating text focus time E.g., “man”, “city” have stable associations with years

Temporal entropy – measure of variability of word associations

Temporal kurtosis – measure of peakness of word associations E.g., “war”, “earthquake” vs. “hitler”, “stalingrad”

1900 1920 1940 1960 1980 2000

word A

word B

Temporal_Entropy(A) < Temporal_Entropy(B)

A(word,year)

1900 1920 1940 1960 1980 2000

word A

Temporal_Kurtosis(A) > Temporal_Kurtosis(B)

A(word,year)

word B

Temporal entropy and Temporal kurtosis

used as temporal weights for words


Importance of Words in Document

Words weakly related to document theme should be skipped

TextRank 0.90 independence

0.82 poland

0.74 war

0.61 nazi

0.56 hitler

0.54 ….

President Obama took part in the

celebrations of the Polish

Independence Day. The US

president met main Polish

politicians in Warsaw.

Poland regained independence at

the end of the World War I

following Bolshevik Revolution.

It then lost the independence as a

result of Nazi and Soviet invasions

led by Hitler and Stalin.

Poland is located in East Europe.

Target Document

Document to

graph conversion

independence

poland war

hitler

…

…

…

…

…

TextRank scores used as discriminatory semantic weights for words

[Mihalcea and Tarau, EMNLP 2004]


Estimating Focus Time

1900 1920 1940 1960 1980 2000

word A

word B

word C

Weighted sum (temporality and

semantics)

Focus time: Interval based

threshold

Time

A(word,year)

1900 1920 1940 1960 1980 2000

A B C

B C

Document

Focus time: Instant based

1900 1920 1940 1960 1980 2000


Combined Approach

Combining estimated focus time and temporal expressions in text

Representing dates on timeline - Gaussian Kernel Density Estimate

Mixture of Gaussian distributions with means centered on extracted

dates

ydSydSydS TempExpEstComb ,,,

---1935------------2011-------------------------------------------1932-------------1940---------------------1932-----2001--

-------------

1932 1935 1940 2001 2011

Target document


News articles collected from Google News Archive using country

names as queries

Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)

Published within [1990, 2010]

Dates falling in [1900, 2013] were found using regular expressions

Experimental Settings: Word Graphs


Experimental Settings: Test Datasets

Datasets on events related to countries:

Wiki: 250 Wikipedia pages about events

Books: 735 paragraphs from 2 text books about history (timelines)

Web: 812 paragraphs from web pages on history (BBC timelines,

etc.)

Datasets total #doc

avr. #sent

avr. time span of events

avr. year of events

avr. #dates

Wiki 250 179 3.4 years 1958 14.5

Book 735 43 4.4 years 1982 4.5

Web 819 18.3 1.3 years 1957 2.4

23 July 2015 60

Experimental Settings: Baselines

Baselines:

Random

Date-based (using only dates in document text)

LDA-based

1. 100 topics over sentences containing year mentions

2. Finding topic distribution of each year

3. Calculating document-year association based on topic distribution

of documents


Experimental Settings: Measurements

Measures:

Average error (in years)

Pearson Correlation Coefficient between ground truth years and

years in focus time

Ground truth

Estimated

focus time

Ground truth

Estimated

focus time

tfocus - + + - - + - - +

Average error (years) for

instant-based representation Correlation measure (-1..+1) for

interval-based representation

error + + - - + + - - +


Experimental Results

Datasets random baseline

LDA baseline

date-based baseline

Proposed (no dates)

Proposed combined

(with dates)

Wiki 36.5 27.2 3 18.3 2.83

Books 39.3 37.3 48.1 23.5 20.4

Web 40.5 41.4 53.4 23.6 20.7

Datasets random baseline

LDA baseline

date-based baseline

Proposed (no dates)

Proposed combined

(with dates)

Wiki 0 0.1 0.65 0.29 0.66

Books 0 0.04 0.01 0.25 0.30

Web 0 0.02 -0.03 0.26 0.41

Average error

Pearson Correlation Coefficient


How well can we estimate focus time of documents about

distant past ?

Effect of Time Distance on Focus Time

Wiki Books

Web Instant-based

focus time representation

23 July 2015 64 The 1st Keystone Summer School: Keyword Search

over Big Data

Question?


over Big Data

Temporal Query Analysis

(1) Temporal query intent

(2) Dynamic query subtopics


over Big Data

Temporal Queries

Temporal information needs

Searching temporal document collections

E.g., digital libraries, web/news archives

Users: historians, librarians, journalists or students

Temporal queries exist in both standard collections and the Web

Relevancy is dependent on time

Documents are about events at particular time


over Big Data

Types of Temporal Queries

Two types of temporal queries 1. Explicit: time is provided, "Presidential election 2012“

2. Implicit: time is not provided, "Germany World Cup"

Temporal intent can be implicitly inferred

I.e., refer to the World Cup event in 2006

Studies of web search query logs show a significant fraction

of temporal queries

1.5% of web queries are explicit

~7% of web queries are implicit

13.8% of queries contain explicit time and 17.1% of queries have

temporal intent implicitly provided

68 [Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010] 23 July 2015

Figure: Variances of

temporal queries and

their dynamics


over Big Data

Understanding Temporal Query Intent

Current approaches:

1. Mining temporal patterns in query logs

2. Analyzing top-k search results

70

[Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012]

[Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015

Motivation

Temporal queries are a significant fraction of Web

search queries [Zhang et al., EMNLP 2010]

13.8% of explicit temporal queries

17.1% of implicit temporal queries

Characteristics:

Certain temporal patterns, i.e., spikes, periodicity

(hourly or daily), seasonality and trends

Underlying temporal information needs without

temporal patterns observed

Tasks:

Understand temporal search intent

Enable advanced enhancement techniques

Automatic method for detecting events in search streams

US Election

2016

Brazil FIFA

World Cup

23 July 2015 71 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Preliminaries

Data model:

Set of queries Q issues at different time points

Set of clicked URLs U and click-through data

Temporal document collection D

q: keywords or term(q), and hitting time(q)

yq: time series data extracted form Q, U and D

Two-step approach:

Automatically extract a set of candidate queries {q1, ..., qn} from Q

Classify candidates as event-related queries {e1, ..., em} using

machine learning techniques


Identifying Event Candidates

Time and keyword-based clustering:

Step1: Partition query logs into one week

• Group queries from the same event

• Possibly contain multiple, unrelated events

Step2: Cluster queries by lexical similarity

• Pre-process and sort queries alphabetically

• Compute Jaccard similarity of a query pair

Easter - easter 2006, easter 2007, easter 20crafts,

easter activities, easter animation, easter animations,

easter background, easter basket, easter bread,

easter bucket, easter bunny, easter bunny decorations,

easter bunny lights


Event-related Query Classification

Classify a query as event-related or not:

Periodic and seasonal events

Popular and trending events

Sporadic (rare) and unseen events

General time-sensitive queries

Underlying temporal information needs

Features:

Time-series features, e.g., seasonality or trends

Popularity-based features, e.g., click-through and burstiness

Statistic features, e.g., probability distribution of results

temporal KL-divergence and skewness (kurtosis)


Query: Easter

Seasonality

Query: World cup

Detect seasonal queries [Shokouhi, SIGIR 2011]

E.g., Annual events, e.g., US Open and Easter,

or a 4-year recurring event, e.g., FIFA World Cup

Method: time-series decomposition using Holt-

Winters adaptive exponential smoothing

Input: time-series data extracted from external

document collections, YD

Compute a cosine similarity as seasonality

Y is the original time-series data

S is the seasonality component


Autocorrelation

Detect trending events by their predictability

Cross correlation with itself or between its

past and future values at different time lags

The stronger inter-day dependencies, the

higher value for autocorrelation

where lag=1, shifting the 2nd time series by

one day, called 1st-order autocorrelation


Temporal KL-divergence

Analyze a temporal distribution in a result set

Measure the difference between the distribution over time

of top-k documents of q and the document collection C

P(t|q) is the probability of generating a publication date t

given q

P(t|C) is the probability of a publication date t in the

collection


Surprise Score

Detect unseen events or surprisingly popular

queries [Radinsky et al. , WWW 2012]

Assume an unplanned event happening when there is

a significant prediction error

Compute the sum of squared errors of prediction

(SSE) using a simple linear regression model


Experiments

Query logs:

• Two datasets, i.e., AOL and MSN

• AOL: 30M queries March 1 - May 31, 2006

• MSN: 15M queries from May 2006

Temporal collection:

• The New York Times Annotated Corpus

• 1.8M documents from 1987 - 2007

Setting:

• HeidelTime for time extraction and OpenNLP for entity extraction

• Cleansing-step parameters: Jaccard similarity threshold>0.2; edit

distance<3; overlap n-gram=2

• For burstiness features, default parameters for the burst detection

technique provided by CISHELL

In total, 837 event-related queries


Experimental Results (I) Feature selection: • Study high-impact (best) features

• Investigate their importance independent from classification algorithms

• InfoGainAttributeEval method in WEKA

Main findings: • Discriminative features are mostly derived

from D and Q

• TemporalKL and kurtosis are among influential features

• Trend-based features, such as, autocorrelation, burst weight, and trending level, play an important role

• Seasonality computed from Q has less impact than the one extracted D


Experimental Results (II)

Query classification:

• Several classifiers, i.e., support vector

machine (SVM), AdaBoost, decision tree

(J48), and neural network (NN)

• Metrics: accuracy, precision, recall, F-

measure using 10-fold cross validation

Main findings:

• J48 is the best performing algorithm

• TemporalKL achieves accuracy of 84%

• Adding autocorrelation, kurtosis, and

seasonality increases the performance

• However, the performance has dropped

after adding max. query frequency, so on


Analyzing Top-k Search Results

Using temporal language models

Determine time of queries when no time is given explicitly

Re-rank search results using the determined time

Exploiting time from search snippets

Extract temporal expressions (i.e., years) from the contents of top-k

retrieved web snippets for a given query

Content-based language-independent approach

82 [Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015

Determining Time of Queries

Approach I. Dating using keywords*

Approach II. Dating using top-k documents*

Queries are short keywords

Inspired by pseudo-relevance feedback

Approach III. Using timestamp of top-k documents

No temporal language models are used

*Using Temporal Language Models proposed by de Jong et al.

83 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

I. Dating using Keywords


I. Dating using Keywords

85

Query’s temporal

profiles

23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

II. Dating using Top-k Documents


II. Dating using Top-k Documents

87

Query’s temporal

profiles


III. Using Timestamp of Documents


III. Using Timestamp of Documents

89

Query’s temporal

profiles


Re-ranking Search Results

query

News archive

Determine time 2005, 2004, 2006, ...

D2009

Initial retrieved results


Intuition: documents published closely to the time of queries are

more relevant

Assign document priors based on publication dates

Intuition: documents published closely to the time of queries are

more relevant

Assign document priors based on publication dates

Re-ranking Search Results

query

News archive

Determine time 2005, 2004, 2006, ...

D2009

Initial retrieved results

D2005

Re-ranked results


march madness

began

14/03/2006

ncaa women

tournament began

18/03/2006 01/04/2006

final four began

query: ncaa

Change of Query Subtopics over Time

92 [Nguyen and Kanhabua, ECIR 2014] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Mining Temporal Anchor Texts

Anchor texts are complementary description

for target pages, widely used to improve search

Characteristics:

Short summary (a few words) of target pages

Collective wisdom of people other than authors

Similar behavior to real-world queries and titles

Capturing aboutness or what a document is about

Main ideas:

Temporal anchor texts mined from the edit history of

Wikipedia as a hook for tracking entity evolution

Large-scale analysis and a more robust discovery of

evolving information using limited resources


over Big Data


1. Partition Wikipedia revisions using the one-month granularity

2. For each Wikipedia snapshot, identify named entity articles/pages

3. Extract anchor texts from all articles linking to an entity page

4. Rank aggregated entity-anchor relationships at a particular time t

[Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94 The 1st Keystone Summer School: Keyword Search

over Big Data


1. Partition Wikipedia revisions using

the one-month granularity

2. For each Wikipedia snapshot,

identify named entity articles/pages

3. Extract anchor texts from all articles

linking to an entity page

4. Rank aggregated entity-anchor

relationships at a particular time t


over Big Data


1. Partition Wikipedia revisions using

the one-month granularity

2. For each Wikipedia snapshot, identify

named entity articles/pages

3. Extract anchor texts from all articles

linking to an entity page

4. Rank aggregated entity-anchor

relationships at a particular time t

President_of_the_

United_States

President

Bush (43)

Time:

10/2005 Barack

Obama

Time:

11/2008

George

W. Bush

Time:

11/2004 23 July 2015 96 The 1st Keystone Summer School: Keyword Search

over Big Data

1. Multi-word title with all words capitalized,

except prepositions, determiners, etc.

E.g., President_of_the_United_States => entity

2. Single-word titles with multiple capital

letters

E.g., UNICEF and WHO => entities

3. 75% of occurrences in the article text itself

are capitalized (not beginning of sentence)

Recognizing Named Entity Articles

[Bunescu and Pasca, EACL 2006] 23 July 2015 97 The 1st Keystone Summer School: Keyword Search

over Big Data

Weight anchor texts by importance with respect

to a target entity at particular time:

• Link-independent : inlink pages are independent and

equally important to the target page

• Compute based on the whole collection of Wikipedia

entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

[Dou et al., SIGIR 2009] 23 July 2015 98 The 1st Keystone Summer School: Keyword Search

over Big Data

Weight anchor texts by importance with respect

to a target entity at particular time:

• Link-independent : inlink pages are independent and

equally important to the target page

• Compute based on the whole collection of Wikipedia

entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

[Dou et al., SIGIR 2009] 23 July 2015 99 The 1st Keystone Summer School: Keyword Search

over Big Data

Experiments

Data collection:

• A dump of English Wikipedia edit history (2.8 TB)

• All pages and revisions 03/2001 to 03/2008

• 85 snapshots + 4 additional snapshots

(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)

Tools:

• Preprocess/store revisions using MWDumper

http://www.mediawiki.org/wiki/Mwdumper

• Store anchor texts: mySQL databases


over Big Data

Top-100 Named Entities


over Big Data



over Big Data

Evolving Context

“Barack Obama”

time

05/2008 03/2009

1. Senator Barack Obama

2. Senator Obama's

legislative

accomplishments

3. Illinois

4. U.S. Sen. Barack Obama


2. Illinois Senator Barack

Obama

3. Barack Hussein Obama II

4. Senator Obama's

legislative

accomplishments

07/2008 10/2008


over Big Data

Evolving Context

“Barack Obama”

time

05/2008 03/2009


2. Senator Obama's

legislative

accomplishments

3. Illinois

4. U.S. Sen. Barack Obama



Obama

3. Barack Hussein Obama II

4. Senator Obama's

legislative

accomplishments

07/2008

1. Senator Barack

Obama


Obama

3. Barak Obama, U.S.

Senator, Illinois, 2008

Democratic nominee for

U.S. President

4. presidential

candidacy

announcement

1. President Barack

Obama


3. U.S. President Barack

Obama

4. 44th President of the

United States

5. Obama Administration

10/2008


over Big Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology


over Big Data

Main Findings




locations


celebrities


technology


over Big Data

Question?


over Big Data

Time-aware Retrieval and Ranking

(1) Recency-based Ranking

(2) Time-dependent Ranking


over Big Data

RECAP

Two time dimensions

1. Publication or modified time

2. Content or focus time


over Big Data

Searching the past

Historical or temporal information needs

A journalist working the historical story of a particular news article

A Wikipedia contributor finding relevant information that has not been written about yet

116

Web

archives

news

archives

blogs emails

“temporal document

collections”

Retrieve documents

about Pope Benedict

XVI written before 2005

Term-based IR approaches

may give unsatisfied results


over Big Data

Temporal Query Examples

A temporal query consists of:

Query keywords

Temporal expressions

A document consists of:

Terms, i.e., bag-of-words

Publication time and temporal expressions


over Big Data

Temporal Query Examples

[Berberich et al., ECIR 2010] 118 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Assign prior probabilities using an exponential function

E.g., a more recent creation date obtains high probability

Current approaches:

Time-based language model [Li and Croft, CIKM 2003]

Using retention functions [Peetz and de Rijke, ECIR 2013]

Incorporating freshness into web authority [Dai and Davison,

SIGIR 2010]

Recency-based Ranking


over Big Data

Time must be explicitly modeled in order to increase the

effectiveness of ranking

To order search results so that the most relevant ones come first

Time uncertainty should be taken into account

Two temporal expressions can refer to the same time period even

though they are not equally written

E.g. the query “Independence Day 2011”

A retrieval model relying on term-matching only will fail to

retrieve documents mentioning “July 4, 2011”

Time-dependent Ranking


over Big Data

Time-dependent Ranking

Two main approaches:

1. Mixture model [Kanhabua et al., ECDL 2010]

Linearly combining textual- and temporal similarity

2. Probabilistic model [Berberich et al., ECIR 2010]

Generating a query from the textual part and temporal part

of a document independently


over Big Data

Mixture Model

Linearly combine textual- and temporal similarity

α indicates the importance of similarity scores

Both scores are normalized before combining

Textual similarity can be determined using any term-based retrieval model

E.g., tf.idf or a unigram language model


over Big Data

Mixture Model

Linearly combine textual- and temporal similarity

α indicates the importance of similarity scores

Both scores are normalized before combining

Textual similarity can be determined using any term-based retrieval model

E.g., tf.idf or a unigram language model

123

How to determine temporal similarity?


over Big Data

Temporal Similarity

Sim

ilarity

score

Time

d1 d2 <q>

Dist(d1,q)

Dist(d2,q)

[Kanhabua et al., ECDL 2010]


over Big Data

Temporal Similarity

Assume that temporal expressions in the query are generated

independently from a two-step generative model:

P(tq|td) can be estimated based on publication time using an

exponential decay function [Kanhabua et al., ECDL 2010]

Linear interpolation smoothing is applied to eliminates zero

probabilities

I.e., an unseen temporal expression tq in d


over Big Data

Comparison of time-aware ranking

Five time-aware ranking models

LMT [Berberich et al., ECIR 2010]

LMTU [Berberich et al., ECIR 2010]

TS [Kanhabua et al., ECLD 2010]

TSU [Kanhabua et al., ECLD 2010]

FuzzySet [Kalczynski et al., Inf. Process. 2005]

126 [Kanhabua et al., SIGIR 2011] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Experiment:

New York Times Annotated Corpus

40 temporal queries [Berberich et al., ECIR 2010]

Result:

TSU outperforms other methods significantly for most metrics

Conclusions:

Although TSU gains the best performance, but only applied to a

collection with time metadata

LMT, LMTU can be applied to any collection without time metadata,

but time extraction is needed

Discussion


over Big Data

128

Applications for Temporal IR

(1) Searching the Future

(2) Time-aware Recontextualization


over Big Data

Searching the Future

People are naturally curious about the future

What will happen to EU economies in next 5 years?

What will be potential effects of climate changes?


over Big Data

Previous work

Searching the future

Extract temporal expressions from news articles

Retrieve future information using a probabilistic model, i.e.,

multiplying textual similarity and a time confidence

Supporting analysis of future-related information in news and

Web

Extract future mentions from news snippets obtained from search

engines

Summarize and aggregate results using clustering methods, but no

ranking

[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 130 23 July 2015

Recorded Future

http://www.recordedfuture.com/


over Big Data

Yahoo! Time Explorer

[Matthews et al., HCIR 2010] 132 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Ranking News Predictions

Over 32% of 2.5M documents from Yahoo! News (July’09 –

July’10) contain at least one prediction

Retrieve predictions related to a news story in news archives and

rank by relevance

133 23 July 2015

Related News Predictions

[Kanhabua et al., SIGIR 2011] 134 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data



over Big Data

Four classes of features

Term similarity, entity-based similarity, topic similarity and temporal

similarity

Rank results using a learning-to-rank technique

Approach


over Big Data

[Kanhabua et al., SIGIR 2011]

Step 1: Document annotation.

Extract temporal expressions

using time and event recognition

Normalize them to dates so they

can be anchored on a timeline

Output: sentences annotated

with named entities and dates,

i.e., predictions

Step 2: Retrieving predictions.

Automatically generate a query

from a news article being read

Retrieve predictions that match

the query

Rank predictions by relevance

(i.e., a prediction is “relevant” if it

is about the topics of the article)

System Architecture


over Big Data

Capture the term similarity between q and p 1. TF-IDF scoring function

Problem: keyword matching, short texts

Predictions not match with query terms

2. Field-aware ranking function, e.g., bm25f

Search the context of a prediction, i.e., surrounding sentences

Term Similarity


over Big Data


Measure the similarity between q

and p using annotated entities in

dp, p, q

Features commonly employed in

entity ranking

Entity-based Similarity


over Big Data


Compute the similarity between q and p on topic Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for

modeling topics

1. Train a topic model

2. Infer topics

3. Compute topic similarity

Topic Similarity


over Big Data


Hypothesis I. Predictions that are more recent to the query are

more relevant

Temporal Similarity


over Big Data


Hypothesis I. Predictions that are more recent to the query are

more relevant

Temporal Similarity

Hypothesis II. Predictions extracted from more recent documents

are more relevant


over Big Data


Learning-to-rank: Given an unseen (q, p), p is ranked using a

model trained over a set of labeled query/prediction

SVM-MAP [Yue et al., SIGIR 2007]

RankSVM [Joachims, KDD 2002]

SGD-SVM [Zhang, ICML 2004]

PegasosSVM [Shalev-Shwartz et al., ICML 2007]

PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]

Ranking Method


over Big Data


42 future-related topics

Relevance Judgments


over Big Data


New York Times Annotated Corpus

1.8 million articles, over 20 years

More than 25% contain at least one prediction

Annotation process uses several language processing tools

OpenNLP for tokenizing, sentence splitting, part-of-speech tagging,

shallow parsing

SuperSense tagger for named entity recognition

TARSQI for extracting temporal expressions

Apache Lucene for indexing and retrieving.

44,335,519 sentences and 548,491 predictions

939,455 future dates (avg. future date/prediction is 1.7)

Experiments


over Big Data


Results:

Topic features play an important role in ranking

Features in top-5 features with lowest weights are entity-based

features

Open issues:

Extract predictions from other sources, e.g., Wikipedia, blogs,

comments, etc.

Sentiment analysis for future-related information

Discussion


over Big Data


Prior to 1964, many of the cigarette

companies advertised their brand by

falsely claiming that their product did not

have serious health risks. A couple of

examples would be "Play safe with Philip

Morris" and "More doctors smoke

Camels". Such claims were made both to

increase the sales of their product and to

combat the increasing public knowledge of

smoking's negative health effects.

Advertisement poster from the

1950s

Time-aware

contextualization

Time-aware Contextualization

23 July 2015 149 [Tran et al., WSDM 2015] (Slide provided by the authors)

Physician

http://en.wikipedia.org/wiki/Physician

Camel (cigarette)

http://en.wikipedia.org/wiki/Camel_(cigarette)

Cigarette

http://en.wikipedia.org/wiki/Cigarette

Entity linking is not sufficient

Wikipedia pages tend to contain large amounts of content

Relevant information might be distributed over various articles

The crucial temporal aspect is missing in pure linking approaches

Entity Linking


Problem Statement

23 July 2015 151

Time-aware contextualization aims to associate an information item

d with time-aware, concise and coherent context information c for

easing its understanding

Several sub-goals of the information search process have to

combined with each other

c has to be relevant for d

c has to complement the information already available in d

c has to consider the time of creation of d

the context information should be concise to avoid overloading the user

[Tran et al., WSDM 2015] (Slide provided by the authors)

User

Article Query

Formulation

Context

Ranking

Contextualization

units Index Context Context

Retrieval

Contextualization units

Extraction

Context

Hook

Identification

Approach Overview


The goal is to generate a set of queries for a given document to

retrieve candidates as input for the re-ranking step

We explore two families of query formulation methods

Document-based methods : title, lead, title+lead

Hook-based methods: each_hook, all_hooks, and query performance

prediction (qpp_r@k) with the following features

Linguistics features

Document frequency

Scope

Temporal document frequency

Temporal scope

Temporal similarity

Query Formulation


Context retrieval:

Learning to rank context:

• The ranking algorithm needs to balance two goals, i.e., high topical and

temporal relevance as well as complementarity for providing additional

information

• Use supervised machine learning that takes as input a set of labeled

examples and various complementarity features

Topic diversity

Text difference

Entity difference

Anchor text difference

Distributional similarity

Cosine distance

Relevance

Temporal similarity

Context Ranking


Experiments

23 July 2015 155

Datasets:

51 news articles from New York Times Corpus

Wikipedia (2013), 26 million contextualization units (paragraphs)

9464 manual labeled examples (article/context pairs)

Learning to rank algorithms: RankBoost, Random Forests and Adarank

Baselines

Entity linking (Milne and Witten)

Language model (LM)

Time-aware language model (LM-T)

[Tran et al., WSDM 2015] (Slide provided by the authors)

Evaluating Query Formulation Methods

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P@1 P@3 P@10 MAP

title+lead

all_hooks

qpp_r@100

Wikification technique

achieves a low recall of 0.229

Hook-based approaches

outperform the document-

based approaches

Query performance

prediction method obtains the

highest results on all metrics

[Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156

The Effect of Complementarity Features

0

0.2

0.4

0.6

0.8

1

P@1 P@3 P@10 MAP

LM-T

RF

Purely using the time dimension

in context retrieval is not sufficient

in the contextualization task

Complementarity plays an

important role in contextualization


Conclusions and Outlook

Introduced the general topic of web evolution.

Pinpointed a number of issues related to temporal IR.

Focused on temporal information extraction, temporal query

analysis, as well as time-aware retrieval and ranking.

Wrapped up with related applications to temporal IR.

Future directions:

Real-time web mining

Spatio-temporal search and analytics

Brain-inspired information access


over Big Data

Thank you!


over Big Data

Download - Search, Exploration and Analytics of Evolving Data

Top Related