neural networks for information retrievalnn4ir.com/ecir2018/slides/03_semanticmatching.pdf ·...

37

Outline

Morning programPreliminariesSemantic matchingLearning to rankEntities

Afternoon programModeling user behaviorGenerating responsesRecommender systemsIndustry insightsQ & A

38

Semantic matchingSemantic matching

Definition”... conduct query/document analysis to represent the meanings of query/documentwith richer representations and then perform matching with the representations.” - Liet al. [2014]

A promising area within neural IR, due to the success of semantic representations inNLP and computer vision.

39

Outline

Morning programPreliminariesSemantic matching

Using pre-trained unsupervised representations for semantic matchingLearning unsupervised representations for semantic matchingLearning to match modelsLearning to match using pseudo relevanceToolkits

Learning to rankEntities


40

Semantic matchingUnsupervised semantic matching with pre-trained representations

Word embeddings have recently gained popularity for their ability to encode semanticand syntactic relations amongst words.

How can we use word embeddings for information retrieval tasks?

41

Semantic matchingWord embedding

Distributional Semantic Model (DSM): A model for associating words with vectorsthat can capture their meaning. DSM relies on the distributional hypothesis.

Distributional Hypothesis: Words that occur in the same contexts tend to have similarmeanings [Harris, 1954].

Statistics on observed contexts of words in a corpus is quantified to derive wordvectors.

I The most common choice of context: The set of words that co-occur in a contextwindow.

I Context-counting VS. Context-predicting [Baroni et al., 2014]

42

Semantic matching

From word embeddings to query/document embeddings

Creating representations for compound units of text (e.g., documents) fromrepresentation of lexical units (e.g., words).

43

Semantic matchingFrom word embeddings to query/document embeddings

Obtaining representations of compound units of text (in comparison to the atomicwords).Bag of embedded words: sum or average of word vectors.

I Averaging the word representations of query terms has been extensively exploredin di↵erent settings. [Vulic and Moens, 2015, Zamani and Croft, 2016b]

I E↵ective but for small units of text, e.g. query [Mitra, 2015].

I Training word embeddings directly for the purpose of being averaged [Kenteret al., 2016].

44

Semantic matchingFrom word embeddings to query/document embeddings

I Skip-Thought VectorsI Conceptually similar to distributional semantics: a units representation is a function

of its neighbouring units, except units are sentences instead of words.

I Similar to auto-encoding objective: encode sentence, but decode neighboringsentences.

I Pair of LSTM-based seq2seq models with shared encoder.

I Doc2vec (Paragraph2vec) [Le and Mikolov, 2014].

I You’ll hear more later about it on “Learning unsupervised representations fromscratch”. (Also you might want to take a look at Deep Learning for SemanticComposition)

http://egrefen.com/docs/acl17tutorial.pdf

http://egrefen.com/docs/acl17tutorial.pdf

45

Semantic matching

Using similarity amongst documents, queries and terms.

Given low-dimensional representations, integrate their similarity signal within IR.

46

Semantic matchingDual Embedding Space Model (DESM) [Nalisnick et al., 2016]

Word2vec optimizes IN-OUT dot product which capturesthe co-occurrence statistics of words from the trainingcorpus:- We can gain by using these two embeddings di↵erently

I IN-IN and OUT-OUT cosine similarities are high for words that are similar byfunction or type (typical) and the

I IN-OUT cosine similarities are high between words that often co-occur in thesame query or document (topical).

47

Semantic matchingPre-trained word embeddings for document retrieval and ranking

DESM [Nalisnick et al., 2016]: Using IN-OUT similarity to model document aboutness.

I A document is represented by the centroid of its word OUT vectors:

~vd,OUT =1

|d|X

td,2d

~vtd,OUT

k~vtd,OUTk

I Query-document similarity is average of cosine similarity over query words:

DESMIN-OUT(q, d) =1

q

X

tq2q

~v>tq,IN~vtd,OUT

k~vtq,INk k~vtd,OUTk

I IN-OUT captures more topical notion of similarity than IN-IN and OUT-OUT.

I DESM is e↵ective at, but only at, ranking at least somewhat relevant documents.

49


GLM [Ganguly et al., 2015]: Generalized Language ModelI Terms in a query are generated by sampling them independently from either the

document or the collection.I The noisy channel may transform (mutate) a term t into a term t0.

p(tq|d) = �p(tq|d)+↵X

td2dp(tq, td|d)p(td)+�

X

t02Nt

p(tq, t0|C)p(t0)+1��↵��)p(tq|C)

Nt is the set of nearest-neighbours of term t.

p(t0, t|d) =sim(~vt0 ,~vt).tf(t0, d)P

t12dP

t22d sim(~vt1 ,~vt2).|d|

50

Semantic matchingPre-trained word embeddings for query term weighting

Term re-weighting using word embeddings [Zheng and Callan, 2015].- Learning to map query terms to query term weights.

I Constructing the feature vector ~xtq for term tqusing its embedding and embeddings of otherterms in the same query q as:

~xtq = ~vtq � 1

|q|X

t0q2q~vt0q

I ~xtq measures the semantic di↵erence of a termto the whole query.

I Learn a model to map the feature vectors thedefined target term weights.

51

Semantic matchingPre-trained word embeddings for query expansion

I Identify expansion terms using word2vec cosine similarity [Roy et al., 2016].I pre-retrieval:

I Taking nearest neighbors of query terms as the expansion terms.I post-retrieval:

I Using a set of pseudo-relevant documents to restrict the search domain for thecandidate expansion terms.

I pre-retrieval incremental:I Using an iterative process of reordering and pruning terms from the nearest neighbors

list.- Reorder the terms in decreasing order of similarity with the previously selected term.

I Works better than having no query expansion, but does not beat non-neural queryexpansion methods.

52

Semantic matchingPre-trained word embedding for query expansion

I Embedding-based Query Expansion [Zamani and Croft, 2016a]Main goal: Estimating a better language model for the query using embeddings.

I Embedding-based Relevance Model:Main goal: Semantic similarity in addition to term matching for PRF.

53

Semantic matchingPre-trained word embedding for query expansion

Query expansion with locally-trained word embeddings [Diaz et al., 2016].

I Main idea: Embeddings be learned ontopically-constrained corpora, instead of largetopically-unconstrained corpora.

I Training word2vec on documents from firstround of retrieval.

I Fine-grained word sense disambiguation.

I A large number of embedding spaces can becached in practice.

54

Outline





55

Semantic matchingLearning unsupervised representations for semantic matching

Pre-trained word embeddings can be used to obtain

I a query/document representation through compositionality, or

I a similarity signal to integrate within IR frameworks.

Can we learn unsupervised query/document representations directly for IR tasks?

56

Semantic matchingLSI, pLSI and LDA

History of latent document representationsLatent representations of documents that are learned from scratch have been aroundsince the early 1990s.

I Latent Semantic Indexing [Deerwester et al., 1990],

I Probabilistic Latent Semantic Indexing [Hofmann, 1999], and

I Latent Dirichlet Allocation [Blei et al., 2003].

These representations provide a semantic matching signal that is complementary to alexical matching signal.

57

Semantic matchingSemantic Hashing

Salakhutdinov and Hinton [2009] propose Semantic Hashing for document similarity.

I Auto-encoder trained on frequencyvectors.

I Documents are mapped to memoryaddresses in such a way thatsemantically similar documents arelocated at nearby bit addresses.

I Documents similar to a querydocument can then be found byaccessing addresses that di↵er by onlya few bits from the query documentaddress.

Schematic representation of Semantic Hashing.Taken from Salakhutdinov and Hinton [2009].

58

Semantic matchingDistributed Representations of Documents [Le and Mikolov, 2014]

I Learn document representations basedon the words contained within eachdocument.

I Reported to work well on a documentsimilarity task.

I Attempts to integrate learnedrepresentations into standard retrievalmodels [Ai et al., 2016a,b].

Overview of the Distributed Memory documentvector model. Taken from Le and Mikolov[2014].

59

Semantic matchingTwo Doc2Vec Architectures [Le and Mikolov, 2014]

Overview of the Distributed Memory documentvector model. Taken from Le and Mikolov[2014]. Overview of the Distributed Bag of Words

document vector model. Taken from Le andMikolov [2014].

60

Semantic matchingNeural Vector Spaces for Unsupervised IR [Van Gysel et al., 2018]

I Learns query (term) and document representations directly from the documentcollection.

I Outperforms existing latent vectorspace models and provides semanticmatching signal complementary tolexical retrieval models.

I Learns a notion of term specificity.

I Luhn significance: mid-frequencywords are more important for retrievalthan infrequent and frequent words.

Relation between query term representationL2-norm within NVSM and its collectionfrequency. Taken from [Van Gysel et al., 2018].

61

Outline





62

Semantic matchingText matching as a supervised objective

Text matching is often formulated as a supervised objective where pairs of relevant orparaphrased texts are given.

In the next few slides, we’ll go over di↵erent architectures introduced for supervisedtext matching. Note that this is a mix of models originally introduced for (i) relevanceranking, (ii) paraphrase identification, and (iii) question answering among others.

63

Semantic matching

Representation-based models

Representation-based models construct a fixed-dimensional vector representation foreach text separately and then perform matching within the latent space.

64

Semantic matching(C)DSSM [Huang et al., 2013, Shen et al., 2014]

I Siamese network between query and document, performed on character trigrams.

I Originally introduced for learning from implicit feedback.

65

Semantic matchingARC-I [Hu et al., 2014]

I Similar to DSSM, perform 1D convolution on text representations separately.I Originally introduced for paraphrasing task.

66

Semantic matching

Interaction-based models

Interaction-based models compute the interaction between each individual term ofboth texts. An interaction can be identity or syntactic/semantic similarity.

The interaction matrix is subsequently summarized into a matching score.

67

Semantic matchingDRMM [Guo et al., 2016]

I Compute term/documentinteractions and matchinghistograms using di↵erentstrategies (count, relativecount, log-count).

I Pass histograms throughfeed-forward network forevery query term.

I Gating network that producesan attention weight for everyquery term; per-term scoresare then aggregated into arelevance score usingattention weights.

68

Semantic matchingMatchPyramid [Pang et al., 2016]

I Interaction matrix between query/documentterms, followed by convolutional layers.

I After convolutions, feed-forward layersdetermine matching score.

69

Semantic matchingaNMM [Yang et al., 2016]

I Compute word interactionmatrix.

I Aggregate similarities byrunning multiple kernels.

I Every kernel assigns adi↵erent weight to aparticular similarity range.

I Similarities are aggregated tothe kernel output byweighting them according towhich bin they fall in.

70

Semantic matchingMatch-SRNN [Wan et al., 2016b]

I Word interaction layer, followed by a spatialrecurrent NN.

I The RNN hidden state is updated using thecurrent interaction coe�cient, and the hiddenstate of the prefix.

71

Semantic matchingK-NRM [Xiong et al., 2017b]

I Compute word-interaction matrix,apply k kernels to every query termrow in interaction matrix.

I This results in k-dimensional vector.

I Aggregate the query term vectors intoa fixed-dimensional queryrepresentation.

I Later extended to convolutionalnetworks [Dai et al., 2018] (hybrid).

72

Semantic matching

Hybrid models

Hybrid models consist of (i) a representation component that combines a sequence ofwords (e.g., a whole text, a window of words) into a fixed-dimensional representation

and (ii) an interaction component.

These two components can occur (1) in serial or (2) in parallel.

73

Semantic matchingARC-II [Hu et al., 2014]

I Cascade approach where word representation are generated from context.

I Interaction matrix between sliding windows, where the interaction activation iscomputed using a non-linear mapping.

I Originally introduced for paraphrasing task.

74

Semantic matchingMV-LSTM [Wan et al., 2016a]

I Cascade approach where inputrepresentations for the interactionmatrix are generated using abi-directional LSTM.

I Di↵ers from pure interaction-basedapproaches as the LSTM builds arepresentation of the context, ratherthan using the representation of aword.

I Obtains fixed-dimensionalrepresentation by max-pooling overquery/document; followed byfeed-forward network.

75

Semantic matchingDuet [Mitra et al., 2017]

I Model has an interaction-based and arepresentation-based component.

I Interaction-based component consist of aindicator matrix showing where query termsoccur in document; followed by convolutionlayers.

I Representation-based component is similar toDSSM/ARC-I, but uses a feed-forwardnetwork to compute the similarity signal ratherthan cosine similarity.

I Both are combined at the end using a linearcombination of the scores.

76

Semantic matchingDeepRank [Pang et al., 2017]

I Focus only on exact termoccurrences in document.

I Compute interaction betweenquery and windowsurrounding term occurrence.

I RNN or CNN then combinesper-window features (queryrepresentation, contextrepresentations andinteraction betweenquery/document term) intomatching score.

77

Outline





78

Semantic matchingBeyond supervised signals: semi-supervised learning

The architectures we presented for learning to match all require labels. Typically theselabels are obtained from domain experts.

However, in information retrieval, there is the concept of pseudo relevance that givesus a supervised signal that was obtained from unsupervised data collections.

79

Semantic matchingPseudo test/training collections

Given a source of pseudo relevance, we can build pseudo collections for trainingretrieval models [Asadi et al., 2011, Berendsen et al., 2013].

Sources of pseudo-relevanceTypically given by external knowledge about retrieval domain, such as hyperlinks,query logs, social tags, ...

80

Semantic matchingTraining neural networks using pseudo relevance

Training a neural ranker using weak supervision [Dehghani et al., 2017].

Main idea: Annotating a large amount of unlabeleddata using a weak annotator (Pseudo-Labeling) anddesign a model which can be trained on weak super-vision signal.

I Function approximation. (re-inventing BM25?)

I Beating BM25 using BM25!

81

Semantic matchingTraining neural networks using pseudo relevance

Generating weak supervision training data for training neural IR model [MacAvaneyet al., 2017].

I Using a news corpus with article headlines acting as pseudo-queries and articlecontent as pseudo-documents.

I Problems:I Hard-Negative

I Mismatched-Interaction: (example: “When Bird Flies In”, a sports article aboutbasketball player Larry Bird)

I Solutions:I Ranking filter:

- top pseudo-documents are considered as negative samples.- only pseudo-queries that are able to retrieve their pseudo-relevant documents areused as positive samples.

I Interaction filter:- building interaction embeddings for each pair.- filtering out based on similarity to the template query-document pairs.

82

Semantic matchingQuery expansion using neural word embeddings based on pseudo relevance

Locally trained word embeddings [Diaz et al., 2016]

I Performing topic-specific training, on a set of topic specific documents that arecollected based on their relevance to a query.

Relevance-based Word Embedding [Zamani and Croft, 2017].

I Relevance is not necessarily equal to semantically or syntactically similarity:I “united state” as expansion terms for “Indian American museum”.

I Main idea: Defining the “context”Using the relevance model distribution for the given query to define the context.So the objective is to predict the words observed in the documents relevant to aparticular information need.

I The neural network will be constraint by the given weights from RM3 to learnword embeddings.

83

Outline





84

Semantic matchingDocument & entity representation learning toolkits

gensim : https://github.com/RaRe-Technologies/gensim [Rehurek andSojka, 2010]

SERT : http://www.github.com/cvangysel/SERT [Van Gysel et al., 2017a]

cuNVSM : http://www.github.com/cvangysel/cuNVSM [Van Gysel et al., 2018]

HEM : https://ciir.cs.umass.edu/downloads/HEM [Ai et al., 2017]

MatchZoo : https://github.com/faneshion/MatchZoo [Fan et al., 2017]

K-NRM : https://github.com/AdeDZY/K-NRM [Xiong et al., 2017b]

https://github.com/RaRe-Technologies/gensim

http://www.github.com/cvangysel/SERT

http://www.github.com/cvangysel/cuNVSM

https://ciir.cs.umass.edu/downloads/HEM

https://github.com/faneshion/MatchZoo

https://github.com/AdeDZY/K-NRM