keyword query expansion paradigm based on recommendation...

Research ArticleKeyword Query Expansion Paradigm Based onRecommendation and Interpretation in Relational Databases

Yingqi Wang NianbinWang and Lianke Zhou

College of Computer Science and Technology Harbin Engineering University Harbin China

Correspondence should be addressed to Nianbin Wang wangnianbinhrbeueducn

Received 15 January 2017 Revised 17 April 2017 Accepted 3 May 2017 Published 29 May 2017

Academic Editor Michele Risi

Copyright copy 2017 Yingqi Wang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Due to the ambiguity and impreciseness of keyword query in relational databases the research on keyword query expansion hasattracted wide attention Existing query expansion methods expose usersrsquo query intention to a certain extent but most of themcannot balance the precision and recall To address this problem a novel two-step query expansion approach is proposed based onquery recommendation and query interpretation First a probabilistic recommendation algorithm is put forward by constructinga term similarity matrix and Viterbi model Second by using the translation algorithm of triples and construction algorithm ofquery subgraphs query keywords are translated to query subgraphswith structural and semantic information Finally experimentalresults on a real-world dataset demonstrate the effectiveness and rationality of the proposed method

1 Introduction

In recent years keyword query in relational databases hasbeen widely applied due to its simplicity and ease of use[1 2] Although this method does not require users to beknowledgeable about the underlying structure of databasesand structured query language (eg SQL) its semanticfuzziness and its expressive power are limited due to the lackof structure [3] In addition ordinary users are usually unableto specify exact keywords to describe their query intentionwhich makes it harder to return adequate results throughkeyword query [4 5] For these reasons the precise and recallof keyword query methods cannot be effectively guaranteed[6] Therefore query expansion has even been an importantresearch branchwhich can completely andprecisely interpretthe query and then improve the recall and precision of queryresults [7ndash10]

Problem andMotivation Query expansion is to provide moredescriptions for the information requirements to improve thequery performance First we formally define the problem ofquery expansion as follows

Definition 1 (query expansion) Given an input keywordquery 119876 = 1198961 1198962 119896119898 over a relational database query

expansion is to find 119896 related queries 119876119894 (119894 = 1 119896)with 119896 largest score(119876 119876119894) where score(119876 119876119894) evaluates therelevance between 119876 and 119876119894

Then we will illustrate the research motivation of thispaper by the following examples and analysis Althoughnumerous query expansion methods can be found in thescientific literature unfortunately these studies suffer fromtwo main limitations First existing approaches do notconsider the relationship between keywords which causesthe low query precision Second most of the existing workneglects the similar words or related items of query keywordsleading to the low query recall On the one hand Meng etal propose a semantic approximate keyword query methodbased on keyword and query coupling relationships [11]Thismethod partly solves the problems of semantic fuzziness andlimitations of expression but it analyzes the keyword andquery coupling relationships through query history Whenthe query history of database is incomplete and evenmissingthe method will not be able to conduct semantic analysisnormally Additionally the queries obtained by this methodcontain only content information related to the keywordsrather than the structure information between keywordsthus affecting the query precision For example suppose auser issues the keyword query ldquoMachine learning Arthur

HindawiScientific ProgrammingVolume 2017 Article ID 7613026 12 pageshttpsdoiorg10115520177613026

2 Scientific Programming

SIGMODMachinelearning

ArthurSamuel

Markov

Figure 1 An expanded query

Samuelrdquo in DBLP to retrieve the paper ldquoMachine learningrdquopublished by Arthur Samuel The query expansion containsldquoMachine learningArthur Samuel SIGMODrdquo using the abovequery expansion method The expansion extends the querywith the content but lacks structure information Conse-quently all tuples containing ldquoMachine learningrdquo ldquoArthurSamuelrdquo or ldquoSIGMODrdquo will be returned Obviously suchresults are not precise enough and many of them may not beuseful Themethod proposed in this paper extends the initialquery with content and structure related to the keywords Anexpanded query is as shown in Figure 1

It can express the potential semantic and structuralinformation of original keyword query find out the paperldquoMachine learningrdquo published in SIGMOD and authored byArthur Samuel simultaneously find out the paper ldquoMarkovrdquowhich cites paper ldquoMachine learningrdquo and authored byArthur Samuel In contrast to the query expansion in [11]themethod proposed in this paper extends the original queryto query subgraphs with underlying structure of databasesand thus further improves the query precision On the otherhand Ganti et al [12] translates keyword queries into SQLbased on the materialized mappings Bergamaschi et al [3]propose aMetadatamethod to translate keyword queries intoSQL based on Munkres algorithm Though these methodsdescribe usersrsquo query intention to a degree they do not takeinto account similar words and related items extension Thusthese methods have relative high query precision but theirrecall needs to be further improved For example assumethat a user hopes to study the methods of data analysisBecause of his knowledge limitation the user accesses thedatabase DBLP and submits a keyword query ldquoMachinelearning Arthur Samuelrdquo and the expanded query obtainedvia the above methods is as follows select R3Title fromAuthor R1 PaperAuthor R2 Paper R3 where R2Pid=R3Pidand R2Aid=R1Aid and match(R3Title) against (ldquoMachinelearningrdquo in Boolean mode) and match(R1Name) against(ldquoArthur Samuelrdquo in Boolean mode) while the user may alsobe intending to retrieve papers which have similar or relevanttopics to ldquoMachine learningrdquo such as paper ldquodata miningrdquopublished by Micheline Kamber Since paper ldquodata miningrdquocites paper ldquoMachine learningrdquo and they are much relatedto each other the user is also interested in it The resultsof the query expansion method proposed in this paper areas shown in Figure 2 Compared with the previous resultthese results not only have structural relationship betweenkeywords but also contain the related or similar queries withquery keywords

The analysis and examples of the query expansion meth-ods in Section 1 illustrate that the key challenge here is todevelop an approach which balances both query precision


ArthurSamuel

Markov

SIGIRData

miningMichelineKamber

Machinelearning

ArthurSamuel

Figure 2 An expanded query of ReInterpretQE

and recall In this paper we focus on the problem of howto tackle the above two limitations and then improve theperformance of keyword query expansion approach Wepropose a novel query expansion method ReInterpretQEbased on query recommendation and interpretation whichextends a keyword query to a list of query subgraphs Thesesubgraphs could better capture usersrsquo information need andthe possible semantics of the keyword query and then guideusers to explore related tuples in the relational database Firstwe construct a term similarity matrix based on the tupleinformation contained in relational databases Second wemake the query recommendation using the similarity matrixand dynamic programming to get a query list This query listconsists of top-119896 queries related to the initial query Finallywe transform the keyword query into query subgraphsThrough the query recommendation and interpretation thequery expansionmethod improves the recall and precision ofquery results

The main contributions of this paper are summarized asfollows

(1) We present a keyword query expansion paradigmReInterpretQE in relational databases which is basedon query recommendation and interpretation

(2) We design a probabilistic recommendation algorithmbased on similarity matrix and dynamic program-ming

(3) We propose a keyword query interpretation methodIt uses statistical information and schema graph ofdatabase for translating a keyword query to querysubgraphs

(4) We conduct extensive experiments on the DBLPdataset and experimental results demonstrate theeffectiveness and feasibility of the proposed method

The paper is organized as follows Section 1 introducesthe research motivation main contribution and structureof this paper Section 2 reviews the related work Section 3describes the architecture of approach ReInterpretQE andthen provides details of algorithms in query recommendationand query interpretation Section 4 conducts the experimentson a real dataset and compares the experimental resultsSection 5 concludes this paper and prospects the study infuture

Scientific Programming 3

2 Related Work

This section discusses the related research work It mainlyincludes the following two parts keyword query in relationaldatabases and keyword query expansion

21 Keyword Query in Relational Databases Recently key-word query methods in relational databases have been stud-ied extensively [2 13ndash15] According to the differentmodelingmethods of databases the existing query methods can bedivided into twomain categories schema graph-basedmeth-ods and data graph-based methods In [16ndash21] the databaseis modeled as a schema graph where nodes represent tablesand edges represent primary-foreign-key relationshipsThesemethods enumerate all possible CNs (Candidate Networks)based on the schema graph Although these methods storethe abstract structural information and take little memorythe generation process of CNs needs to take a huge amountof time Correspondingly in [22ndash27] the database is modeledas a data graph where nodes represent tuples and edgesrepresent primary-foreign-key relationships between tuplesThe methods identify the minimum connection trees thatcontain the keywords on the data graph Amajor challenge ofthemethods is themaintenance of data graphThewhole datagraph needs to be reconstructed when the data in databasechanges which is often a time-consuming exerciseThereforeit is important to study the dynamic update of databases anddesign incremental query methods based on the dynamicconstruction of data graph

22 Keyword Query Expansion Paradigm In the field ofkeyword query in relational databases the majority of exist-ing studies have focused on how to improve the efficiencyof query algorithms while effective query preprocessinghas yet to be investigated Since keyword queries lack thestructural information and users tend to select inappropriatekeywords semantic fuzziness becomes an urgent problemto be solved The method in [28] expands the originalquery by using query log mining techniques but it is notapplicable in the relational databases Reference [7] selectsthe related query expression using user feedback The resultsobtained by this query expression are more accordant withusersrsquo query intention However it needs the interaction ofhumans so its efficiency is low In [3] the keyword query istransformed to a SQL statement based on Munkres whichprovides possible semantic descriptions This method isuseful for identifying usersrsquo query intentionNevertheless theapproach does not consider the multiple connection betweenkeywords (ie there are various explanations for a keywordquery) In addition the above approach does not take intoaccount the similar words and related items which can wellexpress the intention of users An analysis model aboutcoupling relationship is presented in [11] It extracts semanticrelations based on query history However this model alsohas great limitations When the query logs are missing oruserrsquos preference often changes the model cannot be appliedeffectively Therefore this paper proposes a query expan-sion method ReInterpretQE which consists of two stepsquery recommendation and query interpretation First we

construct similarity matrix based on the structure and con-tent information in databases and put forward a probabilisticrecommendation algorithm using dynamic programmingSecond we come up with a keyword query interpretationmethod to transform the keywords to subgraphs based onthe statistics and schema graph of database Experimentalresults show that both the recall and precision of query resultshave been improved by using the proposed query expansionmethod

3 Query Expansion Approach

31 Overview of ReInterpretQE To solve the two problemsof semantic fuzziness and limitations of expression thispaper comes up with a novel two-step query expansionmethod ReInterpretQE This method is based on queryrecommendation and query interpretationThe goal of queryrecommendation is to extend the initial query 119876 to a listof keyword queries related to it so that the query resultsare more comprehensive and better to meet the demandsof users The query interpretation is to translate the list ofkeyword queries into query subgraphs which can lock thequery results more precisely It is designed to improve therecall and precision of query results Figure 3 shows the archi-tecture diagram of ReInterpretQE which is divided into twomain phases query recommendation and query interpreta-tion

Phase 1 (query recommendation) In the process of queryrecommendation the intrasimilarity and intersimilaritybetween terms are calculated using the structure informa-tion content information and words cooccurrence and asimilarity matrix is constructed based on the above twosimilarities Then the idea of dynamic programming isused to build Viterbi model thus a keyword query 119876 =1198961 1198962 119896119898 can be extended to a keyword query list1198761 =(11990211 11990212 1199021119898) sdot sdot sdot 119876119896 = (1199021198961 1199021198962 119902119896119898) The query list pro-duced by the query recommendation process is semanticallyrelated to the original query

Phase 2 (query interpretation) In the process of queryinterpretation an algorithm is put forward for translationfrom keywords to triples Then query subgraphs are builtfor each query in query list using the schema graph ofdatabase The implementation detail of the algorithm will beintroduced in the subsequent sections

First Section 32 describes how to recommend querylist with the same or close similarity of meaning accord-ing to structure information and content information indatabases This problem can be solved effectively by theconstruction of term similarity matrix and probabilisticrecommendation algorithm The method makes the queryresults contain more information that users want to obtainThus the query recall can be further improved Section 33puts forward a two-step query interpretation method Thekeyword query can be translated into query subgraphswith potential structural information through the followingsteps Step 1 translation from keywords to triples Step 2construction of query subgraphs With the above process of


Database

Construction ofterm similaritymatrix

Query recommendation

Query listQuery

interpretation

Querysubgraphs

Translation totriplesm

atrix MTerm

similarity

Keywordquery Q

Results

Subgraphsconstruction

Figure 3 Architecture of ReInterpretQE

query interpretation the query precision is improved effect-ively

32 Query Recommendation It is difficult to specify properkeywords to express query intention for a common userTherefore a lot of information related to query cannot bereturned as results This section presents a query recommen-dation method to extend the original query and derives aseries of queries which have a similar semantic to the originalquery Thus the recall of the query results can be improvedAssume that a user submits a query 119876 = 1198961 1198962 119896119898First Section 321 constructs the term similarity matrix tofind the top-119899 keywords 1198961119894 1198962119894 119896119899119894 related to any keyword119896119894 in original query 119876 Second Section 322 proposes aprobabilistic recommendation algorithm It uses dynamicprogramming to build Viterbi model for translating theoriginal query keyword 119876 = 1198961 1198962 119896119898 to a query list1198761 = (11990211 11990212 1199021119898) sdot sdot sdot 119876119896 = (1199021198961 1199021198962 119902119896119898)

321 Construction of Term Similarity Matrix In the con-struction phase of term similarity matrix from two aspectsof intrasimilarity and intersimilarity this paper calculatesthe similarity between keywords based on the structureinformation and content information

(a) Intrasimilarity Normally in information retrieval if twokeywords often appear in the same documents thus key-words are regarded as semantically related In relationaldatabases tuples are generally taken as virtual documentsSimilarly the higher the cooccurrence of two keywords inthe same tuples the higher the degree of similarity between

Table 1 Relational database DBLP

Pid Title1198751 Database query structured information1198752 Query statistical analysis machine learning1198753 Probability machine learning data mining1198754 Structured information database data mining

two keywords The intrasimilarity between two keywords ismeasured using Jaccard similarity coefficient as shown in

simiin (119896119894 119896119895)

=

119869 (119896119894 119896119895) =10038161003816100381610038161003816TS (119896119894) cap TS (119896119895)1003816100381610038161003816100381610038161003816100381610038161003816TS (119896119894) cup TS (119896119895)10038161003816100381610038161003816

119894 = 1198951 119894 = 119895

(1)

where TS(119896119894) and TS(119896119895) are tuple sets containing 119896119894 and 119896119895respectively

We can obtain the intrasimilarity between any two key-words in databases by formula (1) Example 2 will furthershow the calculation process above

Example 2 To facilitate explanation of the approach wesimplify the structure and content of database as shown inTable 1 There is a data table with four tuples 1198751 1198752 1198753 and1198754 in DBLP database We use da qe si sa ml pr and dm torepresent database query structured information statistical


analysis machine learning probability and data miningTheintrasimilarity between keywords da and qe is as follows

simiin (da qe) = 119869 (119896119894 119896119895) =10038161003816100381610038161198751 1198754 cap 1198751 1198752100381610038161003816100381610038161003816100381610038161198751 1198754 cap 1198751 11987521003816100381610038161003816

= 03(2)

Similarly we can calculate the intrasimilarity between anytwo keywords and get the following intrasimilarity matrixMin

Min =

[[[[[[[[[[[[[[[[[[

da qe si sa ml pr dmda 1 03 1 0 0 0 03qe 03 1 03 05 03 0 0si 1 03 1 0 0 0 03sa 0 05 0 1 05 0 0ml 0 03 0 05 1 05 03pr 0 0 0 0 05 1 05dm 03 0 03 0 03 05 1

]]]]]]]]]]]]]]]]]]

(3)

(b) Intersimilarity The intrasimilarity reflects the directsemantic similarity according to the cooccurrence of key-words Additionally there is also indirect semantic similaritybetween keywords For example although da and sa do notappear in a tuple at the same time they have the indirectsimilarity via keyword qe because da and qe appear in thesame tuple 1198751 while sa and qe appear in the same tuple1198752 Keyword qe is called the semantic associative term thuskeywords da and sa have the indirect semantic similarityNext we will detail the computing process of intersimilarity

Suppose keywords 119896119894 and 119896119895 belong to different tuples andthe set of semantic associative terms is 119879 If 119896119898 is any elementin119879 then the intersimilarity between 119896119894 and 119896119895 via associativeterm 119896119898 is as shown in

simiout (119896119894 119896119895 | 119896119898)= min simiin (119896119894 119896119898) simiin (119896119895 119896119898)

(4)

The intersimilarity between 119896119894 and 119896119895 is as shown in

simiout (119896119894 119896119895)

=

sumforall119896119898isin119879 simiout (119896119894 119896119895 | 119896119898)

|119879| 119894 = 1198951 119894 = 119895

(5)

Example 3 We take the DBLP database in Table 1 forexample as well Example 2 leads us to know that theintrasimilarity between keywords si andml is simiin(siml) =0 They have the intersimilarity via keywords qe and dm

where simiin(si qe) gt 0 simiin(ml qe) gt 0 simiin(si dm) gt0 and simiin(ml dm) gt 0 Thus the intersimilarity betweensi and ml via qe is as follows

simiout (siml | qe)= min simiin (si qe) simiin (ml qe) = 03 (6)

The intersimilarity between si andml via dm is as follows

simiout (siml | dm)= min simiin (si dm) simiin (ml dm) = 03 (7)

In conclusion the intersimilarity between si and ml is asfollows

simiout (siml) = sumforall119896119898isinqedm simiout (siml | 119896119898)1003816100381610038161003816qe dm1003816100381610038161003816

= 03(8)

The intersimilarity matrix of DBLP database in Table 1 isas follows

Mout =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(9)

(c) Construction of Term Similarity Matrix Formula (10)integrates the intrasimilarity and intersimilarity to calculatethe similarity between any two keywords

simi (119896119894 119896119895) = 120572 simiin (119896119894 119896119895)+ (1 minus 120572) simiout (119896119894 119896119895)

(10)

where 120572 isin [0 1] is the balance factor to adjust the contri-bution of two similarities to the final results From the resultof parameter setting experiment in Section 42 the precisionof term similarity calculation reaches the maximum when 120572equals 05 So we can get the following term similarity matrixof DBLP database in Table 1

M =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(11)


Input Keyword query 119876 = 1198961 1198962 119896119898 list of related keywords 1198961119894 1198962119894 119896119899119894 1 le 119894 le 119898Out 1198761 1198762 119876119896 where 119876119894 = (1199021198941 1199021198942 119902119894119898)Method(1) Delete the redundant related keywords

11989610158401 11989610158402 1198961015840119892 larr 11989611 11989621 1198961198991 cup 11989612 11989622 1198961198992 cup sdot sdot sdot cup 1198961119898 1198962119898 119896119899119898 (119892 le 119898 sdot 119899)(2) Build Viterbi model 120582 = (119860 119861 120587)

119860119892119892 = 119886119894119895 1 le 119894 119895 le 119892 119886119894119895 = simi(1198961015840119894 1198961015840119895)119861119892119898 = 119887119894119895 1 le 119894 le 119892 1 le 119895 le 119898 119887119894119895 = simi(1198961015840119894 119896119895)120587119894 = 119875(1198961015840119894 ) = simi(1198961015840119894 1198961)(3) InitializeVariable 120575119905(119894) is the maximum probability in all paths whose state is 119894 at time 119905Variable 120595119905(119894) the 119894 minus 1th node of path with the maximum probability in state 119894 at time 1199051205751 (119894) = 120587119894119887119894 (1198961) 119894 = 1 2 1198921205951(119894) = 0 119894 = 1 2 119892

(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End

Set 1198751198751 1198752 119875119896 = top-1198961le119894le119892120575119898(119894)Set 1199021198981199021119898 1199022119898 119902119896119898 = arg top-1198961le119894le119892[120575119898(119894)](6) Paths backtrackingFor 119905 = 119898 minus 1119898 minus 2 11199021119905 = 120595119905+1(1199021119905+1) sdot sdot sdot 119902119896119905 = 120595119905+1(119902119896119905+1)(7) Return top-119896 keyword queries1198761 = (11990211 11990212 1199021119898) sdot sdot sdot 119876119896 = (1199021198961 1199021198962 119902119896119898)

Algorithm 1 Probabilistic recommendation algorithm

When a user submits a query119876 = 1198961 1198962 119896119898 we canget the keyword lists 1198961119894 1198962119894 119896119899119894 1 le 119894 le 119898 related to theoriginal query according to the term similarity matrixM

322 Probabilistic Recommendation Algorithm This sectioncomes up with a probabilistic recommendation algorithmWe build the Viterbi model using dynamic programming andgenerate the query list related to query input as shown inAlgorithm 1

33 Query Interpretation As a fuzzy querymethod keywordquery cannot reflect query intention accurately This sectiontranslates the keyword query 119876 = (1199021 1199022 119902119898) to a set ofquery subgraphs1198761198661198781198761198661 1198761198662 119876119866119899 by the translationalgorithm of triples and construction algorithm of query sub-graphs where 119876119866119894 is the schema subgraph Compared withthe keyword query the query subgraph not only contains thecontent information but also carries structural and semanticinformation Thus it can more accurately reflect the usersrsquoquery intention

331 Translation from Keywords to Triples In order toidentify usersrsquo query intention we should know exactly therole of a keyword in databases that is to know whether itis Metadata or content data Thus each keyword should beextended to include the table name attribute names andattribute values of the table where the keyword is located

Table 2 Table names statistics

Keywords Table namespaper Papersdot sdot sdot sdot sdot sdot

Table 3 Attribute names statistics

Keywords Attribute namesname PaperNamesdot sdot sdot sdot sdot sdot

Table 4 Attribute values statistics

Keywords Attribute valuesdatabase PaperName1198751sdot sdot sdot sdot sdot sdot

The paper proposes the triple structure 119879119903119894(119879119873119860119873119860119881)to describe the above information Before introducing thetranslation algorithm of triples we first define the followingthree statistics tables statistics table 119879119873 of table namesstatistics table 119860119873 of attribute names and statistics table 119860119881of attribute values They make statistical analysis on the tablenames attribute names and attribute values respectivelyThese three statistics tables have similar structures as shownin Tables 2 3 and 4


By analyzing the statistics tables it is obvious that everykeyword in query 119876 may correspond to several differentsemantic interpretations namely keyword 119896119894 can be trans-lated to a set of triples Query 119876 is also translated to severalsets of triples In most relational databases the numbers oftable names and attribute names are far less than the numberof attribute values so most ambiguities occur during thetranslation process of attribute values Therefore when weconduct the translation we first try to match the keywordsto the table names and attribute names Then we match theremaining keywords to the attribute values to get the final setsof triples Algorithm 2 shows the details of this translationprocess

Lines (1)ndash(9) (10)ndash(24) and (25)ndash(39) in Algorithm 2respectively match the keywords in query 119876 to statistics TNof table names statistics AN of attribute names and statisticsAVof attribute values Triples corresponding to each keywordin query119876 can be obtained As each keywordmay correspondto more than one attribute name or attribute value lines(40)ndash(46) and (47)ndash(52) handle these cases respectivelyIt makes the different attribute names or attribute valuescorresponding to the same keyword translated to multiplesets of triples All possible sets of triples are generated by thealgorithm

332 Query Subgraphs Construction Before the construc-tion of query subgraphs we should combine the triples in119879119903119894119878[119894] = (1198791199031198941198941 1198791199031198941198942 119879119903119894119894119898) The paper makes thefollowing merger rules

We assume that 119879119903119894119878[119894] is the set of triples correspondingto query 119876 119879119903119894119894119895 isin 119879119903119894119878[119894] will be added to a new group if119879119903119894119894119895 satisfied the following three cases

Case 1 The table name of 119879119903119894119894119895 is different from all the tablenames of 119879119903119894119894119896 isin 119879119903119894119878[119894] 119896 isin [1 119895 minus 1]Case 2 The table name of 119879119903119894119894119895 is the same as the one of119879119903119894119894119896 isin 119879119903119894119878[119894] 119896 isin [1 119895 minus 1] but the attribute name andattribute value of 119879119903119894119894119895 are nullCase 3 The table name and attribute name of 119879119903119894119894119895 are thesame as the ones of 119879119903119894119894119896 isin 119879119903119894119878[119894] 119896 isin [1 119895 minus 1] but theattribute of 119879119903119894119894119895 is single-valued

These merger rules are intended to add the triplescorresponding to different entities into different groups sothat the interpretation process of triples is further refinedAfter the translation from keywords to triples and themergers of triples this paper translates the merged triples119879119903119894119866[119894](1198921198941 1198921198942 119892119894119899) to query subgraphs where forall119892119894119895 isin119879119903119894119866[119894] contains all the triples belonging to the same entityAlgorithm 3 describes the construction process of querysubgraphs in detail

Lines (1)-(2) of Algorithm 3 initialize variables Lines(3)ndash(7) create node V1015840119895 for the triple group 119892119894119895 in triple queries119879119903119894119866[119894](1198921198941 1198921198942 119892119894119899) and add it to the query subgraph119876119866 Line (8) traverses the schema graph 119866 to match theminimal subgraph 119878119866 corresponding to the above nodesLines (9)-(10) and (11)-(12) add the intermediate nodes

Table 5 Description of attributes in datasets

Datasets Number of tuples Number of relationshipsDBLP 43M 165M

and edges of 119878119866 to query subgraph 119876119866 respectively Lines(13)-(14) return the set 119876119866119878 of query subgraphs

4 Results and Discussion

Our experiments are conducted on the real dataset DBLP[29] The experiments mainly deal with the selection ofbalance factor 120572 in the calculation process of term similarityand the performance evaluation of our query expansionmethod ReInterpretQE Section 41 introduces the datasetquery sets and experimental environment used in theexperiment Section 42 compares the precisions of the termsimilarity calculation with different parameters to choose theoptimal value of 120572 Section 43 gives contrast experimentsusing Metadata [3] and 119870-coupling [11] as the baselines Theperformance of these algorithms is evaluated respectivelyPerformance metrics include precision recall and 119865-scoreThe experiment is used to verify the performance of themethod ReInterpretQE

41 Experimental Setup

411 Dataset The paper uses the DBLP dataset releasedin March 2015 [29] The address of downloading ishttpdblpuni-trierde The main statistics of the dataset areas shown in Table 5 DBLP is a computer bibliography datasetwidely used for query expansion in relational databasesThe dataset records the information about papers publishedby scholars Its original form is XML and we use the JavaSAX API to parse the XML file Then we can obtain fivedata tables where tables Author Paper and Conferencecontain information about scholars papers and conferencesrespectively Tables Cite and Write are relationship tablesThe former specifies the reference relationships and thelatter contains the writing relationships between scholarsand papers Figure 4 shows a sample of five tables fromthe DBLP dataset The database DBLP contains a numberof tuples They have semantic relevance in content andprimary-foreign key relationships in structure So the DBLPdataset is very appropriate for testing the performance of ourquery expansion method

412 Query Sets In the experiment we invite researchersto choose keywords from DBLP dataset and then build thekeyword query they want to perform By this method 6 setsof querieswith length ranging from 1 to 6 are obtained to formthe query sets Each set contains 10 queries According to theabove method the researchers constantly submit queries inan extensive scope andwe collect 600 queries from researchesas the query history

413 Experimental Environment Our experiments are per-formed on a computer running Windows 10 with Intel(R)


Input 119876 = (1199021 1199022 119902119898)Output 119879119903119894119878[1] 119879119903119894119878[2] 119879119903119894119878[119894] 119879119903119894119878[119894] = (1198791199031198941198941 1198791199031198941198942 119879119903119894119894119898)Method(1) if (119876 is not Null) then(2) for 119894 = 1 to |119876| do(3) scan Table 119879119873(4) if (matched(119902119894) == TRUE) then(5) (119902119894 119899119906119897119897 119899119906119897119897)add To(119879119873119878)(6) 119876remove(119902119894)(7) end if(8) end for(9) end if(10) if (119876 is not Null) then(11) for 119895 = 1 to |119876| do(12) scan Table 119860119873(13) IsAN = FLASE(14) foreach ANi in AN do(15) if (matched(119902119895) == TRUE) then(16) (119902119895119879119873 119902119895 119899119906119897119897)add To(119860119873119878119895)(17) IsAN = TRUE(18) end if(19) end foreach(20) if (IsAN == TRUE) then(21) 119876remove(119902119895)(22) end if(23) end for(24) end if(25) if (119876 is not Null) then(26) for119898 = 1 to |119876| do(27) scan Table AV (28) IsAV = FLASE(29) foreach AVi in AV do(30) if (matched(119902119898) == TRUE) then(31) (119902119898119879119873 119902119898119860119873 119902119898)add To(AVSj)(32) IsAN = TRUE(33) end if(34) end foreach(35) if (IsAN == TRUE) then(36) 119876remove(119902119898)(37) end if(38) end for(39) end if(40) 119879119903119894119878[1] = (1198791198731198781 1198791198731198782 119879119873119878|119879119873119878| 1 1 1)(41) for 119896 = 1 to |119860119873119878| do(42) Lth = Length(119879119903119894119878[ ])(43) for 119899 = 0 to |119860119873119878119896| minus 1(44) 119879119903119894119878[119899 lowast Lth + 1] 119879119903119894119878[119899 lowast Lth + 2] 119879119903119894119878[119899 lowast Lth + Lth] =

(119879119903119894119878[1] 119879119903119894119878[2] 119879119903119894119878[Lth]) lowast (1 1 1 119860119873119878119896119899 1 1)T119860119873119878119896119899 is the (|119879119873119878| + 119896)th(45) end for(46) end for(47) for 119896 = 1 to |119860119881119878| do(48) Lth = Length(119879119903119894119878[ ])(49) for 119899 = 0 to |119860119881119878119896| minus 1(50) 119879119903119894119878[119899 lowast Lth + 1] 119879119903119894119878[119899 lowast Lth + 2] 119879119903119894119878[119899 lowast Lth + Lth] =

(119879119903119894119878[119899 lowast Lth + 1] 119879119903119894119878[119899 lowast Lth + 2] 119879119903119894119878[119899 lowast Lth + Lth]) lowast (1 1 1 119860119881119878119896119899 1 1)T119860119881119878119896119899 is the (|119879119873119878| + |119860119873119878| + 119896)th(51) end for(52) end for(53) Result 119879119903119894119878[1] 119879119903119894119878[2] 119879119903119894119878[119894]

Algorithm 2 Translation algorithm of triples


a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5

PidA hidden Markov model information retrieval system

Tappan Zee (north) bridge mining memory accesses for introspectionKeyword proximity search on XML graphs

Schema-free XQueryFormal models for expert finding in enterprise corpora

19992013200320062004

Yearc2c1

Formal models for expert finding in enterprise corpora

c4c1c3

Cid

Figure 4 Example of relational database DBLP

Input Merged triples 119879119903119894119866[119894](1198921198941 1198921198942 119892119894119899) Schemagraph 119866Output Set of Query Subgraphs 119876119866119878Method(1) 119876119866119878 larr 120601 1198611198731015840 larr 120601 119861119873 larr 120601(2) Let 119876119866 be a query subgraph(3) for 119895 = 1 to |119879119903119894119866[119894]| do(4) Create a node V1015840119895 for group of 119892119894119895(5) Let V1015840119895 corresponds to V119895 in 119866(6) 1198611198731015840 = 1198611198731015840 cup V1015840119895 119861119873 = 119861119873 cup V119895(7) Insert V1015840119895 into 119876119866(8) 119878119866 = 119891119894119899119889119878119906119887119892119903119886119901ℎ(119861119873119866)(9) foreach intermediate node 119894V in 119878119866 do(10) Create a node 119894V1015840 and insert it into 119876119866(11) foreach edge 119890(119894V119909 119894V119910) in 119878119866 do(12) Create an edge 119890(119894V1015840119909 119894V1015840119910) in 119876119866(13) Add 119876119866 into 119876119866119878(14) Return 119876119866119878

Algorithm 3 Query subgraph construction algorithm

Core(TM) i5-4570 CPU 320GHz 4GB of RAM and 1 TBDisk All the algorithms are implemented in Java

42 Parameter Setting The experiment in this section is toevaluate the impact of120572 on the precision of calculation resultsand provides us guidelines in choosing a good value of 120572First we randomly select 8 keywords from the query sets inSection 41 For each keyword 119896119894 we can get the correspond-ing top-6 related keywords by formula (10) Further we canget 11 different sets of results by adjusting the parameter 120572from 0 to 1 Second we integrate the related keywords inthe above results to get the set 119870119894 (set size le 66) Finallywe calculate the cooccurrence rate of keywords in set 119870119894and 119896119894 Mark the keywords ranked top-10 as real set relatedto 119896119894 For keyword 119896119894 and based on real set the precisionsunder different 120572 are as follows the ratio between the numberof keywords related to 119896119894 through formula (10) and thetotal number of the received keywords (13) The precisions

0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10

Figure 5 Precision of results for different values of 120572

corresponding to 8 keywords are summed and averaged toget the final precision Figure 5 illustrates how the precisionis adjusted by the parameter 120572

As shown from Figure 5 the precision of the termssimilarity calculation is the maximum 087 when 120572 equals05 So we choose the parameter 120572 = 05 for the followingexperiments

43 Performance Study In this subsection we report theperformance of the query expansion method ReInterpretQEin comparisonwith the state-of-the-art approachesMetadata[3] and 119870-coupling [11] The performance is measured bythree evaluation metrics precision recall and 119865-score Acorresponding SQL statement is generated for each query inthe query set The results obtained through SQL statementare perceived as real query results and added in the test setTest Given the result set Result and real result set Test theprecision recall and 119865-score can be calculated as follows

precision = Result cap TestResult

(12)

recall = Result cap TestTest

(13)

119865-score = 2 times precision times recallprecision + recall

(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision

2 3 4 5 61Number of keywords

Figure 6 Precision comparison on DBLP database

0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE

Figure 7 Recall comparison on DBLP database

Figures 6 7 and 8 show the comparisons in terms ofprecision recall and 119865-score Each data point on the 119909-axisof Figures 6 7 and 8 corresponds to the number of keywordsThe 119910-axis presents the corresponding precision recall and119865-score

As we can see in Figure 6 the query expansion methodReInterpretQE proposed in this paper achieves much higherprecision than the methods such as Metadata and 119870-coupling significantly For example the average precision ofReInterpretQE reaches 081 209 and 266 higher thanMetadata and119870-coupling respectively Specifically when thenumber of keywords is 4 the precision of ReInterpretQEis 076 while the precision of Metadata and 119870-couplingis 064 and 062 The ReInterpretQE method increases theprecision by 188 and 226 compared with Metadata and119870-coupling respectively Overall the precision of Metadatamethod is little higher than 119870-coupling And the precisionof ReInterpretQE is improved obviously compared with theother two This comparison shows the significance of ourproposed query interpretation which can help to describethe semantics of keyword queries and thus significantlyimprove the query precision More specifically the reasonfor the poor performance of 119870-coupling as compared toMetadata and ReInterpretQE is that 119870-coupling methodfocuses on identifying a set of keyword queries related to

0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE

Figure 8 119865-score comparison on DBLP database

the given keyword query The expanded queries obtained bythe119870-coupling method are still keyword queries without thestructure information between keywords Yet the inherentambiguity of keyword queries may directly affect the queryprecision Methods Metadata and ReInterpretQE transformthe initial keyword query into SQL and query subgraphsrespectively The expansions of both methods contain struc-tural information which is helpful to improve the queryprecision The reason for effective and improved expansionof ReInterpretQE over Metadata is that Metadata does notconsider themultiple connection between keywords and var-ious explanations for a keyword query while ReInterpretQEtranslates the keyword query to a set of query subgraphs withstructural and semantic information through the translationalgorithm of triples and construction algorithm of subgraphsThe subgraphs can locate the query results more precisely

To further evaluate the performance of our method wevary different numbers of input keywords and compare thecorresponding recalls Figure 7 plots the average recall of thedifferent query expansion methods The evaluation resultsshow that ReInterpretQE generally produces expansion ofhigher recall compared to the other two suggesting thatquery recommendation is crucial to obtain good perfor-manceMore precisely when the number of keywords is 4 therecall of Metadata and 119870-coupling is 063 and 073 respec-tively while the one of ReInterpretQE is 080 As expectedthe recall of ReInterpretQE is approximately 96higher thanthat of the 119870-coupling method and it is even higher whencompared with the other method Metadata We observe thatthe methods 119870-coupling and ReInterpretQE beat Metadatasignificantly This is so because the Metadata method doesnot take similar words and related items into account whilethe methods 119870-coupling and ReInterpretQE deal with theproblem accordingly which can help to progressively andefficiently make query expansion and thus lead to higherquery recall ReInterpretQE always generates better resultsthan 119870-coupling For example ReInterpretQE achieves 082average recall which leads to about 51 over 119870-couplingThe reason for this phenomenon is that 119870-coupling onlyuses keyword coupling relationship matrix to analyze theoriginal query while ReInterpretQE conducts the similar


words and related items expansion for every keyword andexpands the query to a query list using probabilistic recom-mendation algorithm ReInterpretQE builds Viterbi modelusing dynamic programming and generates the query listrelated to query input After this operation the query resultscan include more complete and comprehensive informationAnd the recall of results has been further improved

Figure 8 further illustrates that the query expansionmethod ReInterpretQE outperforms the baseline methodsthrough the comparison analysis of these methods on 119865-score The overall trend is clear At all thresholds we evalu-ated the results produced by ReInterpretQE are significantlybetter than the other two methods For example whenthe number of keywords is 4 the 119865-score of Metadataand 119870-coupling is 063 and 067 respectively while the 119865-score of ReInterpretQE is 078 The ReInterpretQE methodincreases the 119865-score by 238 and 164 compared withMetadata and 119870-coupling We investigated the reasons thatReInterpretQE has higher performances than the baselinemethods On the one hand ReInterpretQE calculates theterm similarity considering both intrasimilarity and inter-similarity Thus the similarity calculation between terms ismore reasonableThenAlgorithm 1 probabilistic recommen-dation algorithm is proposed based on the term similarity Itconstructs the Viterbi model using dynamic programmingwhich can improve the query recall On the other handReInterpretQE designs Algorithm 2 translation algorithm oftriples to perform the translation from keywords to triplesThen Algorithm 3 query subgraph construction algorithmis used to transform the triples to query subgraphs In theconstruction of the query subgraphs ReInterpretQE not onlyconsiders the expansion in the structure and content ofkeyword query but also considers the various explanationsfor a keyword query So the query precision is furtherimproved

Summary Based on this observation we realize that ReInter-pretQE indeed boosts the performance of query expansionand has a clear positive effect on quality of query resultsThusReInterpretQE can be considered as a quite effective andpracticable algorithm for query expansion

5 Conclusions

Aiming at addressing the problems of semantic fuzziness andexpression limits the paper proposes a novel two-step queryexpansion method ReInterpretQE The method translatesthe keyword query to query subgraphs with potential struc-tural and semantic information Compared with the tradi-tional methods the method completes the query expansionand analysis only depending on the structure and contentinformation of databases without the requirement of querylogs In addition the method uses query recommendationand query interpretation to balance the precision and recall ofquery results Finally experimental results on DBLP datasetverify the effectiveness of the proposed method There aremany open questions in the research of query expansion inrelational databases In the future work we will make furtherresearch and discussion on it For instance we will take into

account the influence of feedback on the performance ofquery expansion

Conflicts of Interest

The authors declare that they have no conflicts of interest

Authorsrsquo Contributions

Yingqi Wang and Nianbin Wang conceived and designedthe experiments Lianke Zhou performed the experimentsLianke Zhou analyzed the data YingqiWangwrote the paper

Acknowledgments

Thiswork is sponsored by theNationalNatural Science Foun-dation of China under Grant nos 61272185 and 61502037the Fundamental Research Funds for the Central Univer-sities (no HEUCF160602) the Natural Science Founda-tion of Heilongjiang Province of China under Grant nosF201340 and F201238 and the Basic Research Project (nosJCKY2016206B001 and JCKY2015206C002)

References

[1] S Bergamaschi F Guerra and G Simonini ldquoKeyword searchover relational databases issues approaches and open chal-lengesrdquo in Proceedings of the 2013 PROMISE Winter SchoolBridging Between Information Retrieval and Databases pp 54ndash73 Bressanone Italy

[2] J Park and S-G Lee ldquoKeyword search in relational databasesrdquoKnowledge and Information Systems vol 26 no 2 pp 175ndash1932011

[3] S Bergamaschi E Domnori F Guerra R Trillo Lado andY Velegrakis ldquoKeyword search over relational databases Ametadata approachrdquo in Proceedings of 2011 ACM SIGMOD and30th PODS 2011 Conference pp 565ndash576 Athens Greece June2011

[4] Z Zeng Z BaoM L Lee andTW Ling ldquoA semantic approachto keyword search over relational databasesrdquo in Proceedings ofthe 32nd International Conference on Conceptual Modeling pp241ndash254 Hong Kong 2013

[5] M Kargar A An N Cercone P Godfrey J Szlichta and XYu ldquoMeaningful keyword search in relational databases withlarge and complex schemardquo in Proceedings of 2015 31st IEEEInternational Conference on Data Engineering (ICDE rsquo15) pp411ndash422 Seoul Republic of Korea April 2015

[6] L Zhang T Tran and A Rettinger ldquoProbabilistic query rewrit-ing for efficient and effective keyword search on graph datardquoProceedings of theVLDBEndowment vol 6 pp 1642ndash1653 2013

[7] Z Zeng Z Bao T Wang L Mong and L Lee ldquoISearch aninterpretation based framework for keyword search in rela-tional databasesrdquo in Proceedings of 3rd International Workshopon Keyword Search on Structured Data (KEYS rsquo12) pp 3ndash10Scottsdale Ariz USA May 2012

[8] D Yang D-R Shen G Yu Y Kou and T-Z Nie ldquoQueryintent disambiguation of keyword-based semantic entity searchin dataspacesrdquo Journal of Computer Science and Technology vol28 no 2 pp 382ndash393 2013

[9] J Zhai D Shen Y Kou and T Nie ldquoRichly semantical keywordsearching over relational databasesrdquo in Proceedings of 9th Web


Information Systems and Applications Conference (WISA rsquo12)pp 211ndash216 Haikou China November 2012

[10] J Yao B Cui L Hua and Y Huang ldquoKeyword query refor-mulation on structured datardquo in Proceedings of IEEE 28thInternational Conference on Data Engineering (ICDE rsquo12) pp953ndash964 Arlington Va USA April 2012

[11] X Meng L Cao and J Shao ldquoSemantic approximate keywordquery based on keyword and query coupling relationshipanalysisrdquo in Proceedings of 23rd ACM International Conferenceon Information and Knowledge Management (CIKM rsquo14) pp529ndash538 Shanghai China November 2014

[12] V Ganti Y He and D Xin ldquoKeyword++ a framework toimprove keyword search over entity databasesrdquo Proceedings ofthe VLDB Endowment vol 3 pp 711ndash722 2010

[13] J X Yu L Qin and L Chang ldquoKeyword search in relationaldatabases a surveyrdquo IEEE Data Engineering Bulletin vol 33 pp67ndash78 2010

[14] Y ChenWWang Z Liu andX Lin ldquoKeyword search on struc-tured and semi-structured datardquo in Proceedings of InternationalConference on Management of Data and 28th Symposium onPrinciples of Database Systems (SIGMOD-PODS rsquo09) pp 1005ndash1009 Providence RI USA July 2009

[15] Y Chen W Wang and Z Liu ldquoKeyword-based search andexploration on databasesrdquo in Proceedings of 2011 IEEE 27thInternational Conference on Data Engineering (ICDE rsquo11) pp1380ndash1383 Hannover Germany April 2011

[16] S Agrawal S Chaudhuri and G Das ldquoDBXplorer a system forkeyword-based search over relational databasesrdquo in Proceedingsof 18th International Conference on Data Engineering pp 5ndash16San Jose Calif USA March 2002

[17] V Hristidis and Y Papakonstantinou ldquoDiscover keywordsearch in relational databasesrdquo in Proceedings of the 28th inter-national conference on Very Large Data Bases pp 670ndash681Hong Kong 2002

[18] LQin J X Yu and L Chang ldquoKeyword search in databases thepower of RDBMSrdquo in Proceedings of International Conferenceon Management of Data and 28th Symposium on Principlesof Database Systems (SIGMOD-PODSrsquo09) pp 681ndash693 Provi-dence RI USA July 2009

[19] G Li J Feng and L Zhou ldquoRetune retrieving and material-izing tuple units for effective keyword search over relationaldatabasesrdquo in Proceedings of the 27th International Conferenceon Conceptual Modeling pp 469ndash483 Barcelona Spain 2008

[20] Y Luo X Lin W Wang and X Zhou ldquoSpark top-k keywordquery in relational databasesrdquo in Proceedings of SIGMOD 2007ACM SIGMOD International Conference on Management ofData pp 115ndash126 Beijing China June 2007

[21] F Liu C Yu W Meng and A Chowdhury ldquoEffective keywordsearch in relational databasesrdquo in Proceedings of 2006 ACMSIGMOD International Conference on Management of Data pp563ndash574 Chicago Ill USA June 2006

[22] G Bhalotia A Hulgeri C Nakhe S Chakrabarti and S Sudar-shan ldquoKeyword searching and browsing in databases usingBANKSrdquo in Proceedings of the 18th International Conference onData Engineering pp 431ndash440 San Jose Calif USA March2002

[23] HHeHWang J Yang and P S Yu ldquoBLINKS ranked keywordsearches on graphsrdquo in Proceedings of the ACM SIGMOD Inter-national Conference onManagement of Data (SIGMOD rsquo07) pp305ndash316 ACM Beijing China June 2007

[24] BDing J X Yu SWang LQin X Zhang andX Lin ldquoFindingtop-k min-cost connected trees in databasesrdquo in Proceedings of

23rd International Conference on Data Engineering (ICDE rsquo07)pp 836ndash845 Istanbul Turkey April 2007

[25] B Kimelfeld and Y Sagiv ldquoEfficiently enumerating results ofkeyword search over data graphsrdquo Information Systems vol 33no 4-5 pp 335ndash359 2008

[26] V Hristidis H Hwang and Y Papakonstantinou ldquoAuthority-based keyword search in databasesrdquo ACM Transactions onDatabase Systems vol 33 no 1 article no 1 pp 24ndash63 2008

[27] X Yu and H Shi ldquoCI-rank ranking keyword search resultsbased on collective importancerdquo in Proceedings of IEEE 28thInternational Conference on Data Engineering (ICDE rsquo12) pp78ndash89 Arlington Va USA April 2012

[28] R Jones B Rey O Madani andW Greiner ldquoGenerating querysubstitutionsrdquo in Proceedings of the 15th International Confer-ence on World Wide Web pp 387ndash396 Edinburgh UK May2006

[29] DBLP httpdblpuni-trierde

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in




ArthurSamuel

Markov

Figure 1 An expanded query

Samuelrdquo in DBLP to retrieve the paper ldquoMachine learningrdquopublished by Arthur Samuel The query expansion containsldquoMachine learningArthur Samuel SIGMODrdquo using the abovequery expansion method The expansion extends the querywith the content but lacks structure information Conse-quently all tuples containing ldquoMachine learningrdquo ldquoArthurSamuelrdquo or ldquoSIGMODrdquo will be returned Obviously suchresults are not precise enough and many of them may not beuseful Themethod proposed in this paper extends the initialquery with content and structure related to the keywords Anexpanded query is as shown in Figure 1

It can express the potential semantic and structuralinformation of original keyword query find out the paperldquoMachine learningrdquo published in SIGMOD and authored byArthur Samuel simultaneously find out the paper ldquoMarkovrdquowhich cites paper ldquoMachine learningrdquo and authored byArthur Samuel In contrast to the query expansion in [11]themethod proposed in this paper extends the original queryto query subgraphs with underlying structure of databasesand thus further improves the query precision On the otherhand Ganti et al [12] translates keyword queries into SQLbased on the materialized mappings Bergamaschi et al [3]propose aMetadatamethod to translate keyword queries intoSQL based on Munkres algorithm Though these methodsdescribe usersrsquo query intention to a degree they do not takeinto account similar words and related items extension Thusthese methods have relative high query precision but theirrecall needs to be further improved For example assumethat a user hopes to study the methods of data analysisBecause of his knowledge limitation the user accesses thedatabase DBLP and submits a keyword query ldquoMachinelearning Arthur Samuelrdquo and the expanded query obtainedvia the above methods is as follows select R3Title fromAuthor R1 PaperAuthor R2 Paper R3 where R2Pid=R3Pidand R2Aid=R1Aid and match(R3Title) against (ldquoMachinelearningrdquo in Boolean mode) and match(R1Name) against(ldquoArthur Samuelrdquo in Boolean mode) while the user may alsobe intending to retrieve papers which have similar or relevanttopics to ldquoMachine learningrdquo such as paper ldquodata miningrdquopublished by Micheline Kamber Since paper ldquodata miningrdquocites paper ldquoMachine learningrdquo and they are much relatedto each other the user is also interested in it The resultsof the query expansion method proposed in this paper areas shown in Figure 2 Compared with the previous resultthese results not only have structural relationship betweenkeywords but also contain the related or similar queries withquery keywords

The analysis and examples of the query expansion meth-ods in Section 1 illustrate that the key challenge here is todevelop an approach which balances both query precision


ArthurSamuel

Markov

SIGIRData

miningMichelineKamber

Machinelearning

ArthurSamuel

Figure 2 An expanded query of ReInterpretQE

and recall In this paper we focus on the problem of howto tackle the above two limitations and then improve theperformance of keyword query expansion approach Wepropose a novel query expansion method ReInterpretQEbased on query recommendation and interpretation whichextends a keyword query to a list of query subgraphs Thesesubgraphs could better capture usersrsquo information need andthe possible semantics of the keyword query and then guideusers to explore related tuples in the relational database Firstwe construct a term similarity matrix based on the tupleinformation contained in relational databases Second wemake the query recommendation using the similarity matrixand dynamic programming to get a query list This query listconsists of top-119896 queries related to the initial query Finallywe transform the keyword query into query subgraphsThrough the query recommendation and interpretation thequery expansionmethod improves the recall and precision ofquery results

The main contributions of this paper are summarized asfollows

(1) We present a keyword query expansion paradigmReInterpretQE in relational databases which is basedon query recommendation and interpretation

(2) We design a probabilistic recommendation algorithmbased on similarity matrix and dynamic program-ming

(3) We propose a keyword query interpretation methodIt uses statistical information and schema graph ofdatabase for translating a keyword query to querysubgraphs

(4) We conduct extensive experiments on the DBLPdataset and experimental results demonstrate theeffectiveness and feasibility of the proposed method

The paper is organized as follows Section 1 introducesthe research motivation main contribution and structureof this paper Section 2 reviews the related work Section 3describes the architecture of approach ReInterpretQE andthen provides details of algorithms in query recommendationand query interpretation Section 4 conducts the experimentson a real dataset and compares the experimental resultsSection 5 concludes this paper and prospects the study infuture


2 Related Work











Database



Query listQuery

interpretation

Querysubgraphs


atrix MTerm

similarity

Keywordquery Q

Results










simiin (119896119894 119896119895)

=

119869 (119896119894 119896119895) =10038161003816100381610038161003816TS (119896119894) cap TS (119896119895)1003816100381610038161003816100381610038161003816100381610038161003816TS (119896119894) cup TS (119896119895)10038161003816100381610038161003816

119894 = 1198951 119894 = 119895

(1)







= 03(2)


Min =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(3)




(4)


simiout (119896119894 119896119895)

=


|119879| 119894 = 1198951 119894 = 119895

(5)








= 03(8)


Mout =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(9)



(10)


M =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(11)





(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End





































a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in




2 Related Work











Database



Query listQuery

interpretation

Querysubgraphs


atrix MTerm

similarity

Keywordquery Q

Results










simiin (119896119894 119896119895)

=

119869 (119896119894 119896119895) =10038161003816100381610038161003816TS (119896119894) cap TS (119896119895)1003816100381610038161003816100381610038161003816100381610038161003816TS (119896119894) cup TS (119896119895)10038161003816100381610038161003816

119894 = 1198951 119894 = 119895

(1)







= 03(2)


Min =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(3)




(4)


simiout (119896119894 119896119895)

=


|119879| 119894 = 1198951 119894 = 119895

(5)








= 03(8)


Mout =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(9)



(10)


M =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(11)





(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End





































a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in




Database



Query listQuery

interpretation

Querysubgraphs


atrix MTerm

similarity

Keywordquery Q

Results










simiin (119896119894 119896119895)

=

119869 (119896119894 119896119895) =10038161003816100381610038161003816TS (119896119894) cap TS (119896119895)1003816100381610038161003816100381610038161003816100381610038161003816TS (119896119894) cup TS (119896119895)10038161003816100381610038161003816

119894 = 1198951 119894 = 119895

(1)







= 03(2)


Min =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(3)




(4)


simiout (119896119894 119896119895)

=


|119879| 119894 = 1198951 119894 = 119895

(5)








= 03(8)


Mout =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(9)



(10)


M =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(11)





(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End





































a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in






= 03(2)


Min =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(3)




(4)


simiout (119896119894 119896119895)

=


|119879| 119894 = 1198951 119894 = 119895

(5)








= 03(8)


Mout =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(9)



(10)


M =

[[[[[[[[[[[[[[[[[[


]]]]]]]]]]]]]]]]]]

(11)





(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End





































a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in







(4) LoopFor 119905 = 2 3 119898120575119905 (119894) = max

1le119895le119892[120575119905minus1 (119895) 119886119895119894] 119887119894 (119896119905) 119894 = 1 2 119892

120595119905 (119894) = arg max1le119895le119892

[120575119905minus1 (119895) 119886119895119894] 119894 = 1 2 119892(5) End





































a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in


























a1a2a2a3a4

Aidp1p1p2p2p3

Pid

a1a2a3a4a5

AidMiller

Tim LeekJosh Hodosh

A BalminSunil

Name

c1c2c3c4c5

CidACMSIGIRVLDBICDE

SIGMOD

Name

p1p3

Pidp4p5

p1

CitedPid

p2p3p4p5




19992013200320062004

Yearc2c1


c4c1c3

Cid






0

02

04

06

08

1

Prec

ision

01 02 03 04 05 06 07 08 09 10






(12)


(13)


(14)


MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in




MetadataK-coupling

ReInterpretQE

0

02

04

06

08

1

Prec

ision



0

02

04

06

08

1

Reca

ll


MetadataK-coupling

ReInterpretQE




0

02

04

06

08

1

F-sc

ore


MetadataK-coupling

ReInterpretQE








5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in







5 Conclusions







Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in

































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in



keyword query expansion paradigm based on recommendation...

Documents