time-aware approaches to information retrieval

Time-aware Approaches to Information Retrieval

Nattiya Kanhabua

Department of Computer and Information Science Norwegian University of Science and Technology

24 February 2012

Nattiya Kanhabua 2 PhD defense

• Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails

Motivation

Web archives

news archives

blogs emails

“temporal document collections”

Retrieve documents about Pope Benedict XVI written

before 2005

Term-based IR approaches may give unsatisfied results


• A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no

Wayback Machine1

No keyword query

No relevance ranking

1Retrieved on 15 January 2011


• A news archive search tool by Google – Query by keywords – Rank results by relevance or date

Google News Archive Search

Not consider terminology changes over time


• Study problems of temporal search • Propose approaches to solve the problems

• Main research question “How to exploit temporal information in documents, queries,

and external sources in order to improve the retrieval effectiveness?”

Objective of PhD thesis


Part I - Content Analysis RQ1: How to determine time of non-timestamped documents?

Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking?

Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking?

Outline contributions

PART I - CONTENT ANALYSIS

PhD defense Nattiya Kanhabua 7


Problem Statements • Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date

RQ1: Determining time of documents

I found a bible-like document. But I have no idea when it was

created?

Let’s me see… This document is probably written in 850 A.C. with 95% confidence.

“ For a given document with uncertain timestamp, can the contents be used to determine the timestamp

with a sufficiently high confidence? ”


Preliminaries

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped document

Similarity Scores Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Temporal Language Models [de Jong 2005]

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage


Improving document dating

Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights

Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.




Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into document preprocessing






Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the normalized log-likelihood ratio






Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

Intuition: A term weight depends on how good the term is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter


Experiments • Collection

– 9,000 documents collected from the Internet Archive

– 8 years time span, 15 news sources – Randomly select 1,000 documents for testing

• Results – Proposed techniques gain improvement over the baseline

• Precision = the fraction of documents correctly dated

• Open issue – The effectiveness of document dating is still limited

• Highly dependent on the quality of a reference corpus

PART II - QUERY ANALYSIS



• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time

Challenges with temporal queries




query

time1 time2

… timek

suggest





query

synonym@2001 synonym@2002

… synonym@2011

suggest

RQ2: Determining time of queries


Problem Statements • 1.5% of web queries are explicitly provided with temporal

expression [Nunes 2008] – Time is a part of query, “U.S. Presidential election 2008”

• About 7% of web queries have temporal intent implicitly provided [Metzler 2009] – Time is not given in queries, e.g., “Germany World Cup” or

“tsunami” – Difficult to achieve high accuracy using only keywords – Relevant documents associated to particular time not given


1. Determining the time of queries when no time is given 2. Re-ranking search results using the determined time

Our contributions

Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.


Approach I. Dating using keywords*

Approach II. Dating using top-k documents* – Queries are short keywords – Inspired by pseudo-relevance feedback

Approach III. Using timestamp of top-k documents

– No temporal language models are used

*Using Temporal Language Models proposed by de Jong et al.

Determining time of queries


• Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates

Re-ranking search results

query

News archive

Determine time 2005, 2004, 2006, ...

D2009

Initial retrieved results

D2005

Re-ranked results


Determining the time of queries • Collection

– NYT Corpus contains over 1.8M (1987-2007) – 30 time-sensitive queries from the TREC Robust2004

• Results – The smaller top-k, the better precision (k=5 > k=10 > k=15) – The larger g (granularity), the better precision (g=12-month > g=6-month)

Experiments: Part 1 Precision = the fraction of queries correctly dated


Re-ranking of search results • Collection

– TREC Robust2004, 30 time-sensitive queries – NYT Corpus, 24 queries from Google zeitgeist

• Results – Approach III (no TMLs) outperforms all other approaches

• Using publication dates is more accurate than the dating process

• Open issue – Time can improve the effectiveness (if the query dating is improved

with a higher accuracy)

Experiments: Part 2


Challenges of temporal search • Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time

query

synonym@2001 synonym@2002

… synonym@2011

suggest


Problem Statements • Queries composed of named entities (people, organization,

location) – Highly dynamic in appearance, i.e., relationships between terms

changes over time – E.g. changes of roles, name alterations, or semantic shift

RQ3: Handling terminology changes

Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant

Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant


QUEST Demo: http://research.idi.ntnu.no/wislab/quest/


• Discover time-based synonyms over time using Wikipedia – Generally, synonyms are words with similar meanings – This work refers synonyms as alternative names of an entity

• Improve the accuracy of time of synonyms • Query expansion using time-based synonyms

Our contributions

Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.


Recognize named entities


Find synonyms

• Find a set of entity-synonym relationships at time tk

• For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004

President_of_the_United_States George W.

Bush

George W. Bush

President George W. Bush

President Bush (43)

Initial results

• Time periods are not accurate

Note: the time of synonyms are timestamps of Wikipedia articles (8 years)


• Analyze NYT Corpus to discover more accurate time – 20-year time span (1987-2007)

• Use the burst detection algorithm [Kleinberg 2003] – Time periods of synonyms = burst intervals

Enhancement using NYT


Initial results


Query expansion

1. A user enters an entity as a query



Query expansion

1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query



Query expansion

1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query



Part 1- Synonym detection • Collection

– The whole history of English Wikipedia • all pages and revisions 03/2001 to 03/2008 • 85 month snapshots about 2.8 Terabytes

• Result – Randomly selected 500 entity-synonym relationships for evaluating

• Accuracy 51% for all types of entities • Accuracy 73% for people, organization, and company

Part 2 - Query expansion • Collection

– TREC Robust2004 Track (250 queries) – NewsLibrary.com over 100M U.S. news articles (20 temporal queries) – Result

• Baseline: Probabilistic Model without query expansion • QE significantly improves the effectiveness over the baseline for both collections

• Open issues – Only the name changes of famous persons can be discovered

Experiments


Two problems are addressed 1. Performance prediction

• Predict the retrieval effectiveness wrt. a ranking model

Query prediction problems

query precision = ?

recall = ? MAP = ?

predict


Two problems are addressed 1. Performance prediction

• Predict the retrieval effectiveness wrt. a ranking model

2. Ranking prediction • Predict the ranking model that is most suitable

Query prediction problems

query ranking = ?

predict max(precision) max(recall) max(MAP)


Problem Statement • Predict the effectiveness (e.g., MAP) that a query will achieve

in advance of, or during retrieval [Hauff 2010] – high MAP “good” – low MAP “poor”

Objective • Apply query enhancement techniques to improve the

overall performance – Query suggestion is applied for “poor” queries

• To best of our knowledge, predicting the performance of temporal queries has never done before

RQ4: Query performance prediction


• Contributions – First study of performance prediction for temporal queries – Propose 10 time-based pre-retrieval predictors

• Both text and time are considered • Experiment

– Collection: NYT Corpus and 40 temporal queries [Berberich 2010] • Results

– Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases

• Open issue – Increase the number of queries – Consider time uncertainty

Discussion

Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.


• Problem statement – Two time dimensions: publication time and content time

• Content time = temporal expressions mentioned in documents

– Difference in effectiveness for temporal queries when ranking using publication time or content time

RQ5: Time-aware ranking prediction


• Contributions – First study of the impact on effectiveness of ranking models

using the two time dimensions – Three features from analyzing top-k documents

• Temporal KL-divergence [Diaz 2004] • Content Clarity [Cronen-Townsend 2002] • Divergence of retrieval scores [Peng 2010]

• Results – A small number of top-k documents achieves better

performance – The larger number k, the more irrelevant documents are

introduced into the analysis • Open issue

– When comparing with the optimal case there is still room for further improvements

Discussion

Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).

PART III - RETRIEVAL AND RANKING MODELS



• Problem statements – Time must be explicitly modeled in order to increase

the effectiveness – Time uncertainty should be taken into account

• Two temporal expressions can refer to the same time period even though they are not equally written

• Example – Given the query “Independence Day 2011”, a retrieval

model relying on term-matching will fail to retrieve documents mentioning “July 4, 2011”

RQ6: Time-aware ranking models


• Contributions – Analyze and compare five ranking methods

• Experiment – Collection: NYT Corpus and 40 temporal queries[Berberich 2010]

• Result – TSU outperforms other methods significantly for most metrics

• Conclusions – Although TSU gains the best performance, it is limited for a

document collection with no time metadata – LMT, LMTU can be applied to any collection without time

metadata, but extraction of temporal expressions is needed.

Discussion

Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.


• Problem statement – Can the combination of time and other features help improving

the retrieval effectiveness?

RQ7:Ranking related news predictions

• A new task called ranking related news predictions – Retrieve predictions related to a news story in news archives – Rank them according to their relevance to the news story


Related news predictions


• Define the task ranking related news predictions – Searching the future is proposed in [Baeza-Yates 2005]

• Propose four classes of features – Term similarity, entity-based similarity, topic similarity and

temporal similarity • Rank predictions using learning-to-rank [Liu 2009] • Make available the dataset with over 6000 judgments

Contributions

Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.


• NYT Corpus – More than 25% contain at least one prediction

• Feature analysis – Topic features play an important role in ranking – Features in top-5 features with lowest weights are entity-

based features

• Open issues – Extract predictions from other sources, e.g., Wikipedia,

blogs, comments, etc. – Sentiment analysis for future-related information.

Experiments


Solutions to all research questions: Part I - Content Analysis RQ1: How to determine time of non-timestamped documents?

Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking?

Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking?

Conclusions


• Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.

• Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009

• Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.

• Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.

• Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010.

• Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report.

Publications


• [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005.

• [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 13-25, 2010.

• [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’02, pp. 299-306, 2002.

• [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04, pp. 18-24, 2004.

• [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 204-216, April 2010.

• [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the Association for History and Computing, AHC '05, pp. 161-168, 2005.

• [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp. 373-397, October 2003.

• [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March 2009.

• [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’09, pp. 700-701, 2009.

• [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584, 2008.

• [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010.

References

Thank you


time-aware approaches to information retrieval

Presentations & Public Speaking

time of documents

time of queries

time partition

trustworthy time

web documents time gap

timeaware approaches

term weights nattiya

suitable timeaware ranking