[ieee 2007 2nd international conference on pervasive computing and applications - birmingham, uk...

Specific Academic Area based Automatic Query Expansion

Yuan Yuan, Yong Zhang, Chunxiao XingResearch Institute ofInformation Technology, Tsinghua University, Chinayuan-yO2@mails. tsinghua. edu. cn, tzhangyongO5, xingcx}@tsinghua. edu. cn

Abstract

Most of the current search engines focus on thecontent-based searching; therefore, they require theusers to input accurate query terms. Unfortunately,most people would like to input just a few words whendoing the search and the words themselves might havediverse meanings, which badly decrease the searchingperformance. The previous automatic query expansionalgorithms usually use the documents from all of thefields to calculate the similarity between terms. At thesame time, they didn't consider the non-correlationbetween terms to expand the query. Because theinterests of users are often limited to a relative narrow

branch, the global relationships among terms usuallyintroduce some irrelevant noise. Therefore, if users can

choose some relevant area and then do the search, theresults can largely fulfill the users' requirements. Inthis paper, we choose a specific academic area and addthe non-correlation into query expansion. Ourexperiments have well demonstrated the effectiveness ofcategory-based search.

Keywords: Automatic Query Expansion, Similarity,Category-based Search, Non-Correlation

1. Introduction

With the rapid development of the internet, thesearch engines have become more and more importantfor people to access information. However, due to thediversity in meaning and synonyms of the search terms,the search results will inevitably be influenced by manyinformal search terms. This problem is called"dictionary problem", first mentioned in [1], in whichthe experiment manifests that the probability of thedifferent people using the same words to express thesame idea is less than 20%. Moreover, it is said in [2]4900 of the users use only one term in the search, 3300of the users use two and the average number of thesearch terms is 1.4. Consequently, this situation causes

the deprecation of the search efficiency, as most of thecurrent search engines are implemented based on theterms' vector model [10]. In the worst case, people haveto change the search terms frequently to get theinformation they want.

To solve this problem, researchers [5, 6] haveproposed many methods to address this problem whichmainly focused on the global dictionary constructed bythe terms in the documents from all of the fields.Because the interests of users are often limited to a

relative narrow branch, the global relationship among

terms usually introduces some irrelevant noise. Others[8, 9] propose methods to lower the irrelevancy byusing the dynamic results of the previous search to getthe correlation of the terms to be expanded in the re-

search. However, the corresponding overheads of theonline calculation are too heavy. Therefore, in thispaper we train the correlation of different terms offlinein a certain area (the academic area of personalization isselected for illustration) and build up the correlationgraph for the automatic query expansion. Theexperiments on the database of the ACM demonstratethis method can improve the accuracy of the correlationbetween terms and avoid the overhead of the onlinecalculation.

The rest of this paper is organized as follows. Thecorrelation graph and automatic query expansion are

introduced in section 2. Section 3 presents a new

method of automatic query expansion by limiting thedocuments in a specific academic area and introducingthe non-correlation into the expansion. Experimentalstudy is demonstrated in section 4. We conclude insection 5.

2. Preliminaries

2.1 The correlation graph

Correlation graph was used in [3, 4] to portray theinterests of users for the database queries. The basicelements of the graph are the nodes and the directededges between the nodes, where the nodes represent theterms in the user interest vectors, and edges representthe relationship among terms. There are four kinds ofedges, including the conjunction edges, the disjunctionedges, the negation edges and the substitution edges.The semantics of these four kinds of edges are

corresponding to the four relationships in the electroniclogic. The database system will reconstruct the query

according to the interests graph to implement the query.

Moreover, each edge has a single value for the systemto find an optimal expansion path to do the query. Inour research, we modify the correlation graph according

1-4244-0971-3/07/$25.00 ©)2007 IEEE.

to our needs to express the correlation of differentterms.

2.2 Automatic query expansion

Automatic query expansion refers to the automaticreconstruction of the original query (including addingnew terms to the original query or substituting theoriginal query with other terms) by underlying machineaccording to the similarity between terms instead of thatbased on users' resubmission to the search engine, toimprove the performance of the search engine. Thereare many methods existed to do the automatic queryexpansion. According to the information used fortraining the correlation of the terms, these methods canbe classified into two main categories, namely the localanalysis and the global analysis.

The global analysis always uses the documents fromall of the fields to calculate the similarities betweenterms and build up a global dictionary. When a newquery comes, it uses the most similar term in thedictionary to expand the original query. The commonlyused global analyses include latent semantic indexing[5], and global similarity dictionary [6]. There is alsosome semi-automatic expansion, which needs theparticipation of the users. For example, the SNAKETsystem [7] communicates with the users and lets theusers make the final decision which terms will beselected for the expansion. This method is more precisein calculating the similarity between terms, but takesmore response time because the need of users'feedback. However, these methods' conspicuousdisadvantages lie in that if the whole amount of thedocuments is widely ranged, the correlation betweenterms may not reflect the users' interests accurately.

To avoid the inherent problems of the globalanalysis, the local analysis uses just part of the wholedocuments. It is developed on the basis of the relevancefeedback [8], which is the also semi-automatic. Thismethod contains the communication with the users todetermine which documents are the most relevant in theoriginal search results and then uses these documents tocalculate the similarity of terms for the expansion.Nevertheless, the pseudo feedback [9] does not need theparticipation of the users. It takes the first N entries inthe original results as the most relevant ones and usesthem in the training. However, this method can also beproven problematic sometimes, especially when thefirst N documents are not that relevant.

As both of the two analyses can introduce theunrelated noises into the results, we propose a methodwhich is based on certain specific area, that is, wecalculate the similarity only according to the documentswhich are limited to the specific area. And thiscorrelation between terms is comparatively stable whichcan be calculated offline. Moreover, in our expansion,we also add into the new concept about the non-correlation between terms to further improve the

performance. The detailed discussion is presented in thenext section.

3. Specific academic area based automaticquery expansion

In this section, we modify the correlation graph byintroducing the non-correlation, and elaborate themethod of automatic query expansion based on thedocuments limited in a specific academic area.

3.1 The correlation graph between terms

We use a directed graph G (V, E) (V is the set of thenodes and E is the set of the edges) to represent thecorrelations among different terms. The nodes are theterms characterizing certain specific area. Here, we uset1 to denote the starting node of a edge, and tr to denotethe end node correspondingly. Accordingly, there arethree types of edges, as shown in E (Fig. 1):1) Conjunction edges, which mean that whensomeone uses the term t1, then the system willautomatically expand the original t1 with t, to form areconstructed query "4t & tr". They are directed edges.2) Non-correlation edges, by using this kind ofedges, we can exclude some unrelated documents fromthe search results, that is, when someone uses the termst1 for query, the system will automatically expand thequery to "t1 & - tr". They are also directed edges.3) Substitution edges, which mean we can use theterm tr to substitute the original term t1. This kind ofedges can avoid the frequent changing of the query termto get the desirable information. This kind of edges isundirected.

substitution edge

non-correlation edge

conjunction edge

Figure 1. Three types of edges

Each edge has a value ranging from 0 to 1 to reflectthe extent of similarity between the two terms of theedge, that is, the correlation degree or the non-correlation degree. These values are calculatedaccording to the documents in certain areas, which willbe discussed in the next section. A typical graph isgiven in Fig.2.

ecommender C orativesyte 2 ilern

clusteringSOA webservice ""o)~~ personalization

Figure 2. A typical correlation graph

3.2 The construction of the correlation graph

The construction of a correlation graph consists ofthree steps:

a ). Get the term set V. First, we need to get theterms of the correlation graph. As the expansion isbased on certain academic area, we select the termswhich can present the content of documents related tothis area (for example, we choose personalization). Weadopt the Document Frequency algorithms for theextraction of the feature terms. Then we use TFIDFalgorithms [ 11] on the document set and the mostrepresentative terms for each document are picked up tocompose the term set T= {t1,t2,. . .tn}. Below is thenormalization formula to calculate the weights of theterms, which are the degrees to represent thecorresponding documents.

w1)i(t) = tfi ln(N/ni + 0. 1)

dti(t)2 ln2(NIni + 0. 1)

co,: the weight ofterm i for representing a document;N: the number of all the documents;n,: the number of documents where term i appears.tf: the frequency of term i appearing in thatdocuments.b). Build up the conjunction and substitution edges

between terms. We use terms in the set as the query

terms to search in the ACM database for trainingpurpose. For each query term ti, we record the first 50entries (sometimes there are less than 50 entries, thenwe record them all) as the results set

ri = .PilI2, ,Pin} (n < 50) (piy denotes the jthdocument in the search result of the term i) and the set

M

of all documents isR = Ur , where M is the number ofi=I

query terms and m is the number of documents in R.Based on this set, we can express each term as a 0/1vector, t1 =< (p,w),(p2,wi) ,(pm,wi ) >,where 1< i

<AM, and WJ E 1{0,1} . If WJ = 1, the results of the term i

contain thejth document in R. Otherwise, if the result setdoes not have this document, then WJ = 0. After all theterms are represented by the 0/1 vectors, the similarity

between term i and term j is measured by the Cosinesimilarity as follows:

m

E, wi1 . W j'

Sim(ti, tj) = k=1

[m m-

,W(ik) 2 (Wkj)

k=l k=l

We define that the correlation degree of term i withtermj is:

D(i, j) = A Sim(t, tj) +(1_-A). nij , iE [0,1]ni

Where ninj is the number of result documentscorresponding to the query "i & j", and the ni is thenumber of the result documents corresponding to term i.

Likewise, the correlation degree of term j with termi can also be calculated. D(i, j) and D(j, i) are the valuesof directed conjunction edges in the correlation graph.If one of D(i, j) and D(j, i) is not smaller than 0.8, wetreat the two terms as the substitution of each other andthe two directed edges between the two terms can bereplaced by a single substitution edge with the value ofthe larger one in D(i, j) and D(j, i). If both D(i, j) andD(U, i) are below 0.2, we treat the two terms asuncorrelated. Those between 0.2 and 0.8 will be thevalues for the directed conjunction edges.

c). Build up the non-correlation edges. For this kindof edges we define the non-correlation degree of theterm i with term j as D(i, j) = nin - ni where the nin -

means the number of result documents correspondingto the query "i & -j".

3.3 The automatic expansion strategies basedon the correlation graph

As we mentioned above, there are three types ofrelationships between different terms, so when weimplement a query expansion, there are severalstrategies under different circumstances. First, wedeclare some variables for the convenience of laterdiscussion.1) Original term: t;2) The out conjunction edge CO(t, t1), where t and t1are the nodes of the edge CO; CO(t, t1) representsexpansion from t to "t & t1"; COS = {CO1, CO2,CO} is the set of the edges related to t, which are in thedescending order of their values. Tco = {tCOI, tCO2,tCOn} is the set of the corresponding terms related withthe edges in COS.3) The in conjunction edge CI(t, t2), where t and t2are the nodes of the edge CI; CI(t, t2) representsexpansion from t2 to "t & t2"; CIS = {C1I, C12,..., C1n}is the set of the edges related to t, which are in thedescending order of their values. TcI {tJCI1, tCI2,..., tCln}is the set of the corresponding terms related with theedges in CIS.4) The substitutinn edge S(t, t3), where t and t3 are thenodes of the edge S; SS {S1, S2,..., Sj3 is the set of

the edges related to t, which are in the descending orderof their values. Ts = {tSI, tS2,..., tSn} is the set of thecorresponding terms related with the edges in SS.5) The out non-correlation edge NO(t, t4), where tand t4 are the nodes of the edge NO; NO(t, t4) representsexpansion from t to "t & -t4"; NOS = {NO1, NO2,...,NOJ3 is the set of the edges related to t, which are in thedescending order of their values. TNO = {tNO1, tNO2, ...tNOn} is the set of corresponding terms related with theedges in NOS.6) The in non-correlation edge NI(t, t5), where t andt5 are the nodes of the edge NI; NI(t, t5) representsexpansion from t5 to "-t & t5". NIS = {N1I, NI2,..., NI1is the set of the edges related to t, which are in thedescending order of their values. TNI {tNI1, tNI2,. . ., tNInis the set of corresponding terms related with the edgesinNIS.

In the next part, we will describe our strategies forquery expansion of the original term t under differentsituations in detail. We use Et to denote the set of theedges connected with term t.1) Et only includes one edge.A. If the edge is a substitution edge S, we can substitutethe original term t with t3 and expand the queryaccording to the correlation of t3.B. If the edge is an out conjunction edge CO, we justexpand the query t to "t & t1".C. If the edge is an out non-correlation edge NO, weexpand the query t to "t & -t4".D. If the edge is the in conjunction edge CI or the innon-correlation edge, we will not expand the termsaccording to this correlation graph.2) Et includes several edges of the same kind.E. If the edges are substitution edges, we will choosethe edges with the biggest value and then deal with itlike the situation A.F. If the edges are out conjunction edges, here wesimply expand the original term t to query "t & cLitcoi &((2tco2 & & u1ntCOn "(a1= 0, 1), with a1 =1 and a,> aj(i<')G. If the edges are out non-correlation edges, we willadopt the same strategy like the situation F; that is,expand the term t to query "t & -a1tNoi &...& 'antNon(ai= O, 1), with a1 =I and ai> aj (i<j).H. If the edges are in conjunction edges or the in non-correlation edges, we will not expand the original termt.3) E, includes several edges of different kinds.

First, we deal with the different edges (includingthe conjunction edges and the non-correlation edges) inthe ways discussed above respectively, except thesubstitution edge. Then, we can get the expanded query"t & ac1tcol & a2tC2 &...& UCtOn" and "t & -aUtNoi&. . .& -UntNOn". Then, we combine these twoexpansions to "t & cLItco, & U12tCO2 &... & 1ntCon11 t &-'1tNO1 &... .& -UntNOn"( l" means "or") to do thesearch. In all these expressions, a,1= 0, 1. As to thesubstitution edge S, we substitute the term t with term t3

and expand it. At last, we just merge the respectiveresults set together to form the final results.

4. Experiment and performance evaluation

As mentioned above, we choose the ACM databaseto get the document set used in our research. First, wedownloaded about 150 related documents in -a specificarea personalization. On this document set, we usedthe TFIDF method to get the feature terms of this area.At the same time, we found out the most commonlyused query terms according to the survey of the studentsand teachers in this area. Finally we got 12 terms intotal, which are collaborative filtering, recommendersystem, information filtering, content-based filtering,user profile, user model, agent, clustering, informationretrieval, personalization, recommendation, userpreferences. In our experiments, we also use thedocuments that contain some of these words but notrelated to this area for the purpose of getting someunrelated feature terms which can cause the unrelatedresults in the search. They are diversification,information extraction, web ranking, sparse data,Bayesian networks, SOA wed service, data streammining.

Using these terms, we did the queries on the ACMdatabase and collected all the result documents as ourdocument set. Based on this document set, we designour experiments to assess the performance of theexpanding algorithms. In all experiments, we used anintegrated criteria balancing recall and the precisionFt to evaluate the effectiveness of our expansions.1) Recall (RA): defined as Ne/No, where Ne is the

number of the documents generated by theexpanded query; No is the number of thedocuments generated by the original query.

2) Correlation (Cor): defined as Re/Rtotai, where Rtotaiis the number of the result documents we assessed(in our experiments, the number is fixed on thefirst 50 results). Re is the number of documentswhich are related to this academic area in the first50 documents;

3) P= 2 RA Corl(RA + Cor):

We used the term "collaborative filtering" as theoriginal query. For different values of X, we built up thecorrelation graphs respectively and expanded the query.Below is part of our correlation graph when the X = 0.5.

srice Rcorecomen gawder en networ=s

Figure 3. Part of the correlation graph when A=O.5

4.1 The expansion performance with varyingdegree of similarity

To investigate the influence on the expansionperformance of terms with different similarity, weconduct the following experiment. We use thecorrelation graph (X=0.5) to illustrate the relationshipbetween the degree of similarity and the expansionperformance. In this experiment, we only use theconjunction edges for expansion for simplicity. Otherkinds of expansion reveal similar results.

0. 7

0. 6

0. 5

0. 4

0. 3

0. 2

0. 1

00.83 0.58 0.51

degree of similarity0. 46

Figure 4. The performance of expansion with term"Collaborative Filtering'

0. 380. 370. 360. 350. 340. 33

0. 320. 310. 3

0. 290.51 0.49 0.33

degree of similarity0. 24

Figure 5. The performance of expansion with term"Information Filterinp'

From the two graphs (Fig.4 and Fig.5), which depictthe expansion performances changes of two differentoriginal terms, we can see that as the degree of thesimilarity between the original term and the expandingterm decreases, the performance of expanded query alsodecrease. This finding can well support our strategyabout choosing the term with the maximum degree ofsimilarity to expand the original term.

4.2 The expansion performance with varyinglayer numbers

The equally important setting in expansion is howmany terms should be included. To study its influence,we also use the term "collaborative filtering" and theterm "information filtering" for illustration. Asdescribed above, we can determine the number of layersfor expansion by controlling the value of a,. In ourexperiments, we did the expansion based on theconjunction edges of each term, and the accordingperformance is shown in Fig.6.

0. 7

0. 6

0.

0. 4

0. 3

0. 2 47N0.11

0 1 2 3layers of the expansion

Figure 6. The comparison of expansion withdifferent layers

Form Fig.6 we can conclude that the best number ofthe expansion layers is one, so we use just one of theconjunction edges doing the expansion to get the bestperformance.

However, we would also like to mention that littlework has been done for this topic, so it is relatively hardto conduct a fair and effective experiment forcomparison. But user experience of our system revealthat our approach can indeed improve the performancein searching to a large extent.

5. Conclusions and Future Work

Although the automatic query expansion improvesthe performance of the searching engine to some extent,the methods for implementing the expansion have along way to go to be more efficient. Because theinterests of users are often limited to a relative narrowbranch, we confine the expansion in certain area toavoid the irrelevant noise, which might be introducedby the previous methods.

Ak.

lwN.,

j

In addition, the previous view about the expansion isto add the new terms which are most similar with theoriginal tern; the more similar words have been addedin, the better results can be got. However, sometimesexcluding the irrelevant documents is easier and moreuseful to get better results. Therefore, in this paper wepropose a concept of non-correlation in the expansionand describe the expansion strategies using thisrelationship in detail. The experiments demonstrate theefficiency of this method.

With all of the works above, there are still someproblems which need further works. In our experiments,we have demonstrated the effectiveness of ourcorrelation graph and have discovered the best numberof expansion layers for the original terns. Anotherwork we have already started but not finished is to findout the influence of the different values of X on theexpansion performance.

Acknowledgement

This work is supported by the Support Program ofthe National '11th Five-Year-Plan' of China under GrantNo. 2006BAH02AOO; the National High TechnologyDevelopment Program of China under Grant No.2006AA010101; and Tsinghua University FIT Fundingunder Grant No. 041506021.

References

[1]. G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T.Dumais, "The vocabulary problem in human-systemcommunication", Communication of ACM, 1987,30(11),964-971.

[2]. R. Wen, J. Y. Nie, and H. J. Zhang, "Clustering userqueries of a search engine", Proceedings of the 10thInternational World Wide Web Conference (WWWJ0), NewYork, USA, May 0 1-05, 2001, pp. 162-168.[3]. G. Koutrika and Y. loannidis, "A Unified User ProfileFramework for Query Disambiguation and Personalization",Proceedings of the Workshop on New Technologies forPersonalized Information Access (PIA 2005) in conjunctionwith the 10th Int. Conf on User Modeling (UM'05),Edinburgh, UK, July 24-25, 2005, pp. 44-53.[4]. G. Koutrika and Y. loannidis, "Constrained Optimalitiesin Query Personalization", SIGMOD 2005, Baltimore,Maryland, USA, June 14-16, 2005, pp. 73-84.[5]. S. Deerwester, S. T. Dumai, G. W. Fumas, T. K.Landauer, and R. Harshman, "Indexing by latent semanticanalysis", Journal of ACM Transactions on InformationSystems, 2000, 18(1), 79 112.[6]. Y. Qiu and H. Frei, "Concept based query expansion",Proceedings of the 16th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval. New York, USA, Jun 27-Jul 01, 1993, pp. 160-169.[7]. P. Ferragina and A. Gulli, "A Personalized SearchEngine Based on Web-Snippet Hierarchical Clustering",WWW'2005, Chiba, Japan, May 10-14, 2005, pp. 801-810.[8]. R. Attar and A. S. Fraenkel, "Local feedback in full-textretrieval systems", Journal of the ACM, 1977, 24(3),397 417.[9]. C. Buckley, G. Salton, J. Allan, and A. Singhal,"Automatic query expansion using SMART", TechnicalReport, TREC-3, 1995, pp. 69 80.[10]. C. Chang, C. Hsu, "Enabling concept-based relevancefeedback for information retrievalon the WWW", IEEETransactions on Knowledge and Data Engineering, 1999,11(4), pp. 595-609.[11]. D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G.Karypis, V. Kumar, B. Mobasher, and J. Moore,"Partitioning-based clustering for Web documentcategorization", Decision Support Systems, 1999, 27(3), pp.329-341.

[ieee 2007 2nd international conference on pervasive computing and applications - birmingham, uk...

Documents