combining word semantics within complex hilbert space for information retrieval

1
Combining Word Semantics within Complex Hilbert Space for Information Retrieval Peter Wittek, Bevan Koopman, Guido Zuccon, and Sándor Darányi K EY POINTS We combine distributional and concep- tual representation; Two random indices merge to create a complex space; Efficiency of information retrieval im- proves; The phase of the complex vector carries additional information. M OTIVATION Combining term frequency and inverse docu- ment frequency in a complex space provided poor retrieval effectiveness, and the power of representation was questioned [1]. In concept-based indexing, documents are rep- resented by concepts rather than terms, as is in- stead the case for traditional term-based repre- sentations. It is highly efficient in medical in- formation retrieval [2]. We encode both distributional information and higher-level conceptual information in the components of the complex numbers. An L 2 space-based representation embedded the two kinds of semantics by a seriation of the feature space [3]. Seriation is slow, the only preprocess- ing we need is mapping the terms to concepts. T OY EXAMPLE Consider the following documents and query: Document D 1 =“kidney stonesDocument D 2 =“kidneyDocument D 3 =“renal calculiQuery Q =“kidney stonesA term-based retrieval system processing Q would return only the documents D 1 , D 2 (in this order). However, renal calculi in D 3 is actually a synonym of the query kidney stones, while D 2 is actually not relevant. Therefore, D 3 should be ranked higher than D 2 . Using our method, when a concept representa- tion is included, the phrases kidney stones and renal calculi both map to the same SNOMED CT concept 155868000. Thus, our ranking approach would retrieve D 1 in the first rank position because of the contribution of both term and concept weighting; and D 3 above D 2 because of the inclusion of the con- cept weighting. C ONSTRUCTING THE SPACE Terms Concepts MetaMap Random Index Terms Random Index Concepts To construct the complex space, we rely on Se- manticVectors to generate to random indices – one based on the term representation, one based on a concept representation. The two random indices are merged in one complex space. R ESULTS 11-point precision-recall graph 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall Term Concept Complex MAP and P@10 for term retrieval, concept retrieval, and our method Measure Term-based Concept-based Complex MAP 0.0886 0.1084 0.1245 P@10 0.1593 0.1963 0.2235 Distribution of angles between real and imaginary components of complex vectors representing documents from the TREC Medical Records Track collection. Angles were expressed in radians and were normalised in the range [0, 2π [. 90 degrees 1.60 1.65 1.70 An gle 100 200 300 400 500 600 700 Frequency Mean value and extrema of phase Mean absolute phase Lowest Phase Highest Phase 6.9520 -39.9163 28.6791 R EFERENCES [1] Zuccon, G., Piwowarski, B., Azzopardi, L.: On the use of complex numbers in quantum models for infor- mation retrieval. In: Proceedings of ICTIR-11, 3rd International Conference on the Theory of Information Retrieval, Bertinoro, Italy (September 2011) [2] Koopman, B., Zuccon, G., Bruza, P., Sitbon, L., Lawley, M.: Graph-based concept weighting for medical information retrieval. In: Proceedings of ADCS-12, 17th Australasian Document Computing Symposium, Dunedin, New Zealand (December 2012) 80–87 [3] Wittek, P., Tan, C.L.: Compactly supported basis functions as support vector kernels for classification. Transactions on Pattern Analysis and Machine Intelligence 33(10) (2011) 2039 –2050 M ORE INFORMATION Source code and steps to repro- duce the results are available here: http://peterwittek. com/2013/07/ merging-in-complex-space/

Upload: peter-wittek

Post on 11-May-2015

213 views

Category:

Technology


2 download

DESCRIPTION

Complex numbers are a fundamental aspect of the mathematical formalism of quantum physics. Quantum-like models developed outside physics often overlooked the role of complex numbers. Specifically, previous models in Information Retrieval (IR) ignored complex numbers. We argue that to advance the use of quantum models of IR, one has to lift the constraint of real-valued representations of the information space, and package more information within the representation by means of complex numbers. As a first attempt, we propose a complex-valued representation for IR, which explicitly uses complex valued Hilbert spaces, and thus where terms, documents and queries are represented as complex-valued vectors. The proposal consists of integrating distributional semantics evidence within the real component of a term vector; whereas, ontological information is encoded in the imaginary component. Our proposal has the merit of lifting the role of complex numbers from a computational byproduct of the model to the very mathematical texture that unifies different levels of semantic information. An empirical instantiation of our proposal is tested in the TREC Medical Record task of retrieving cohorts for clinical studies.

TRANSCRIPT

Page 1: Combining Word Semantics within Complex Hilbert Space for Information Retrieval

Combining Word Semantics withinComplex Hilbert Space for Information

RetrievalPeter Wittek, Bevan Koopman, Guido Zuccon, and Sándor Darányi

KEY POINTS• We combine distributional and concep-

tual representation;

• Two random indices merge to create acomplex space;

• Efficiency of information retrieval im-proves;

• The phase of the complex vector carriesadditional information.

MOTIVATIONCombining term frequency and inverse docu-ment frequency in a complex space providedpoor retrieval effectiveness, and the power ofrepresentation was questioned [1].In concept-based indexing, documents are rep-resented by concepts rather than terms, as is in-stead the case for traditional term-based repre-sentations. It is highly efficient in medical in-formation retrieval [2].We encode both distributional informationand higher-level conceptual information in thecomponents of the complex numbers. An L2

space-based representation embedded the twokinds of semantics by a seriation of the featurespace [3]. Seriation is slow, the only preprocess-ing we need is mapping the terms to concepts.

TOY EXAMPLEConsider the following documents and query:

• Document D1 = “kidney stones”

• Document D2 = “kidney”

• Document D3 = “renal calculi”

• Query Q = “kidney stones”

A term-based retrieval system processing Qwould return only the documents D1, D2 (inthis order). However, renal calculi in D3

is actually a synonym of the query kidneystones, while D2 is actually not relevant.Therefore,D3 should be ranked higher thanD2.

Using our method, when a concept representa-tion is included, the phrases kidney stonesand renal calculi both map to the sameSNOMED CT concept 155868000. Thus, ourranking approach would retrieve D1 in thefirst rank position because of the contributionof both term and concept weighting; and D3

above D2 because of the inclusion of the con-cept weighting.

CONSTRUCTING THE SPACE

Terms ConceptsMetaMap

Random IndexTerms

Random IndexConcepts

To construct the complex space, we rely on Se-manticVectors to generate to random indices– one based on the term representation, onebased on a concept representation. The tworandom indices are merged in one complexspace.

RESULTS

11-point precision-recall graph

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

TermConceptComplex

MAP and P@10 for term retrieval, concept retrieval, and our methodMeasure Term-based Concept-based ComplexMAP 0.0886 0.1084 0.1245P@10 0.1593 0.1963 0.2235

Distribution of angles between real and imaginary components of complexvectors representing documents from the TREC Medical Records Track

collection. Angles were expressed in radians and were normalised in therange [0, 2π[.

90

de

gre

es

1.60 1.65 1.70Angle

100

200

300

400

500

600

700

Frequency

Mean value and extrema of phaseMean absolute phase Lowest Phase Highest Phase

6.9520 -39.9163 28.6791

REFERENCES

[1] Zuccon, G., Piwowarski, B., Azzopardi, L.: On the use of complex numbers in quantum models for infor-mation retrieval. In: Proceedings of ICTIR-11, 3rd International Conference on the Theory of InformationRetrieval, Bertinoro, Italy (September 2011)

[2] Koopman, B., Zuccon, G., Bruza, P., Sitbon, L., Lawley, M.: Graph-based concept weighting for medicalinformation retrieval. In: Proceedings of ADCS-12, 17th Australasian Document Computing Symposium,Dunedin, New Zealand (December 2012) 80–87

[3] Wittek, P., Tan, C.L.: Compactly supported basis functions as support vector kernels for classification.Transactions on Pattern Analysis and Machine Intelligence 33(10) (2011) 2039 –2050

MORE INFORMATIONSource code and steps to repro-duce the results are availablehere:http://peterwittek.com/2013/07/merging-in-complex-space/