combining word semantics within complex hilbert space for information retrieval
DESCRIPTION
Complex numbers are a fundamental aspect of the mathematical formalism of quantum physics. Quantum-like models developed outside physics often overlooked the role of complex numbers. Specifically, previous models in Information Retrieval (IR) ignored complex numbers. We argue that to advance the use of quantum models of IR, one has to lift the constraint of real-valued representations of the information space, and package more information within the representation by means of complex numbers. As a first attempt, we propose a complex-valued representation for IR, which explicitly uses complex valued Hilbert spaces, and thus where terms, documents and queries are represented as complex-valued vectors. The proposal consists of integrating distributional semantics evidence within the real component of a term vector; whereas, ontological information is encoded in the imaginary component. Our proposal has the merit of lifting the role of complex numbers from a computational byproduct of the model to the very mathematical texture that unifies different levels of semantic information. An empirical instantiation of our proposal is tested in the TREC Medical Record task of retrieving cohorts for clinical studies.TRANSCRIPT
Combining Word Semantics withinComplex Hilbert Space for Information
RetrievalPeter Wittek, Bevan Koopman, Guido Zuccon, and Sándor Darányi
KEY POINTS• We combine distributional and concep-
tual representation;
• Two random indices merge to create acomplex space;
• Efficiency of information retrieval im-proves;
• The phase of the complex vector carriesadditional information.
MOTIVATIONCombining term frequency and inverse docu-ment frequency in a complex space providedpoor retrieval effectiveness, and the power ofrepresentation was questioned [1].In concept-based indexing, documents are rep-resented by concepts rather than terms, as is in-stead the case for traditional term-based repre-sentations. It is highly efficient in medical in-formation retrieval [2].We encode both distributional informationand higher-level conceptual information in thecomponents of the complex numbers. An L2
space-based representation embedded the twokinds of semantics by a seriation of the featurespace [3]. Seriation is slow, the only preprocess-ing we need is mapping the terms to concepts.
TOY EXAMPLEConsider the following documents and query:
• Document D1 = “kidney stones”
• Document D2 = “kidney”
• Document D3 = “renal calculi”
• Query Q = “kidney stones”
A term-based retrieval system processing Qwould return only the documents D1, D2 (inthis order). However, renal calculi in D3
is actually a synonym of the query kidneystones, while D2 is actually not relevant.Therefore,D3 should be ranked higher thanD2.
Using our method, when a concept representa-tion is included, the phrases kidney stonesand renal calculi both map to the sameSNOMED CT concept 155868000. Thus, ourranking approach would retrieve D1 in thefirst rank position because of the contributionof both term and concept weighting; and D3
above D2 because of the inclusion of the con-cept weighting.
CONSTRUCTING THE SPACE
Terms ConceptsMetaMap
Random IndexTerms
Random IndexConcepts
To construct the complex space, we rely on Se-manticVectors to generate to random indices– one based on the term representation, onebased on a concept representation. The tworandom indices are merged in one complexspace.
RESULTS
11-point precision-recall graph
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
TermConceptComplex
MAP and P@10 for term retrieval, concept retrieval, and our methodMeasure Term-based Concept-based ComplexMAP 0.0886 0.1084 0.1245P@10 0.1593 0.1963 0.2235
Distribution of angles between real and imaginary components of complexvectors representing documents from the TREC Medical Records Track
collection. Angles were expressed in radians and were normalised in therange [0, 2π[.
90
de
gre
es
1.60 1.65 1.70Angle
100
200
300
400
500
600
700
Frequency
Mean value and extrema of phaseMean absolute phase Lowest Phase Highest Phase
6.9520 -39.9163 28.6791
REFERENCES
[1] Zuccon, G., Piwowarski, B., Azzopardi, L.: On the use of complex numbers in quantum models for infor-mation retrieval. In: Proceedings of ICTIR-11, 3rd International Conference on the Theory of InformationRetrieval, Bertinoro, Italy (September 2011)
[2] Koopman, B., Zuccon, G., Bruza, P., Sitbon, L., Lawley, M.: Graph-based concept weighting for medicalinformation retrieval. In: Proceedings of ADCS-12, 17th Australasian Document Computing Symposium,Dunedin, New Zealand (December 2012) 80–87
[3] Wittek, P., Tan, C.L.: Compactly supported basis functions as support vector kernels for classification.Transactions on Pattern Analysis and Machine Intelligence 33(10) (2011) 2039 –2050
MORE INFORMATIONSource code and steps to repro-duce the results are availablehere:http://peterwittek.com/2013/07/merging-in-complex-space/