heterogeneous cross domain ranking in latent space

Heterogeneous Cross Domain Ranking in Latent Space

Bo Wang

Joint work with Jie Tang, Wei Fan and Songcan Chen

Framework of Learning to Rank

Example: Academic Network

Ranking over Web 2.0

Traditional Web: standard (long) documents relevance measures such as BM25 and Pa

geRank score may play a key role Web 2.0: shorter non-standard docum

ents users' click-through data and users' com

ments might be much more important

Heterogeneous transfer ranking

If there isn't sufficient supervision on the domain of interest, how could one borrow labeled information from a related but heterogeneous domain to build an accurate model?

Differences from transfer learning What to transfer Instance type What we care Feature extraction

Main Challenges How to formalize the problem in a

unified framework? As both feature distributions and

objects' types in the source domain and the target domain may be different.

How to transfer the knowledge of heterogeneous objects across domains?

How to preserve the preference relationships between instances across heterogeneous data sources?

Outline Motivation Problem Formulation Transfer Ranking

Basic Idea The proposed algorithm Generalization bound

Experiment Ranking on Homogeneous data Ranking on Heterogeneous data

Conclusion

Problem Formulation Source domain:

Instance space: Rank level set:

where Target domain: and The two domains are heterogeneous but

related Problem Definition: given and ,

the goal is to learn a ranking function for predicting the rank levels of test set




Conclusion

Basic Idea Because the feature distributions or even

objects' types may be different across domains, we resort to finding a common latent space in which the preference relationships in source and target domains are all preserved

We can directly use a ranking loss function to evaluate how well the preferences are preserved in that latent space

Optimize the two ranking loss functions simultaneously in order to find the best latent space

The Proposed AlgorithmGiven the labeled data in source domain We aim to learn a ranking function which

satisfies:

The ranking loss function can be defined as:

The latent space can be described by

The Framework:

Ranking SVM

Generalization Bound

Scalability Let d is the total number of different

features in two domains, then matrix D is d*d and W is d*2, so it can be applied to very large scale data without too many features

ComplexityRanking SVM training has O((n1 + n2)3) time and O((n1 + n2)2) space complexity, in our algorithm Tr2SVM, T is the maximal iteration number, then Tr2SVM has O((2T +1)(n1 + n2)3) time and O((n1 + n2)2) space complexity for training




Conclusion

Data Set LETOR 2.0

three sub datasets: TREC2003, TREC2004, and OHSUMED query-document pairs collection TREC data: a topic distillation task which aims to find goo

d entry points principally devoted to a given topic OHSUMED data: a collection of records from medical jour

nals LETOR_TR

three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR

Data Set (Cont’d)

Experiment Setting Baselines:

Measures:MAP (mean average precision) and NDCG (normalized discount cumulative gain)

Three transfer ranking tasks: From S1 to T1 From S2 to T2 From S3 to T3

Why effective? Why transfer ranking is

effective on LETOR_TR dataset? Because the features used in ranking

already contain relevance information between queries and documents.




Conclusion

Data Set A subset of ArnetMiner: 14,134 authors, 10,716 papers,

and 1,434 conferences. 8 most frequent queries from log file:

'information extraction', 'machine learning', 'semantic web', 'natural language processing', 'support vector machine', 'planning', 'intelligent agents' and 'ontology alignment'

Author collection: For each query, we gathered authors from Libra, Rexa and Ar

netMiner Conference collection:

For each query, we gathered conferences from Libra and ArntetMiner

Evaluation One faculty and two graduates to judge the relevance betw

een query and authors/conferences

Feature Definition

All the features are defined between queries and virtual documents

Conference Use all the paper titles published on a

conference to form a conference "document"

Author Use all the paper titles authored by an expert

as the expert's "document"

Feature Definition (Cont’d)

Experimental Results

Why effective?

Why our approach can be effective on the heterogeneous network?Because of latent dependencies between the objects, some common features can still be extracted from the latent dependencies.

Conclusion

Conclusion (Cont’d) We formally define transfer ranking problem

and propose a general framework We provide a preferred solution under the

regularized framework by simultaneously minimize two ranking loss functions in two domains and derive the generalization bound

The experimental results on LETOR and a heterogeneous academic network verified the effectiveness of the proposed algorithm

Future Work

Develop new algorithms under the framework

Reduce the time complexity for online usage

Negative transfer Similarity between queries Actively select similar queries

Thanks!

Your Question. Our Passion.

heterogeneous cross domain ranking in latent space

Documents

heterogeneous domain

ranking function

heterogeneous data sources

wheretarget domain

labeled data

common latent space

ranking loss functions

matrix d