graphinder semantic search relational keyword search over data graphs

of 35 /35
Graphinder Semantic Search Relational Keyword Search over Data Graphs Thanh Tran , Lei Zhang, Veli Bicer, Yongtao Ma Researcher: www.sites.google.com/site/kimducthanh Co-Founder: www.graphinder.com

Author: etta

Post on 24-Feb-2016

50 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Graphinder Semantic Search Relational Keyword Search over Data Graphs. Thanh Tran , Lei Zhang, Veli Bicer, Yongtao Ma Researcher: www.sites.google.com/site/kimducthanh Co-Founder: www.graphinder.com. Agenda. Introduction Graphinder : Overview Keyword Query Translation - PowerPoint PPT Presentation

TRANSCRIPT

Semantic Wiki Search

Graphinder Semantic SearchRelational Keyword Search over Data GraphsThanh Tran, Lei Zhang, Veli Bicer, Yongtao MaResearcher: www.sites.google.com/site/kimducthanhCo-Founder: www.graphinder.com

Thanks for the introduction. Iduring my time in Erurope I enjouzed working with cooleagues and friends from Yahoo! Research barcelona a lot. Now I am very glad to have the opportunity to learn about Yahoow research activities in Santa Clara.Today, I give talk to provide somewhat high-level overview of my semantic search research and hope that this talk provides some ideas or raises some questions. I just moved from to Bay Area, and literally, lives around the block, so it would be very convienient for me to meet and follow-up on the questions you might have,

BioThanh Tran worked as consultant and software engineer for IBM and Capgemini and served as assistant professor at Karlsruhe Institute of Technology (KIT) and visiting assistant professor at Stanford University. His research is focused on Semantic Data Management & Search. He has helped to establish a meanwhile visible international Semantic Search community through benchmarking activities, tutorials and the series of workshops called SemSearch. His interdisciplinary work in thisfield has been published in numerous top-level conference proceedings and journals and earned prizes and a best paper award at top-tier Semantic Web conferences. Currently, he is a Computer Science faculty at San Jose State University and director of Semsolute,a semantic search technologies company he co-founded with researchers from his previous KIT semantic search team. GRAFinder Semantic Search Relational Keyword Search over Data Graphs Semantic search technologies use the meaning of entities and relationships explicitly given in structured data to provide relevant and concise answers for complex queries. With the increasing availability of structured data in the past few years, many semantic search applications were introduced to enable users to directly search for entities such as people, places and products. Based on a manually specified grammar that is optimized for Facebooks data, the newly launched search engine, called Graph Search, not only supports entity search but also more complex relational queries that involve relationships between entities. In this talk, we discuss the research challenges behind building such a relational search engine, called GRAFinder. It operates in a more generic open-domain setting where information needs greatly vary and customized grammars acting as query templates cannot be assumed. The two main search concepts it supports are semantic auto-completion and query translation. As user types, it suggests queries that not only are syntactically correct but also meaningful, i.e., can be understood in terms of entities and relationships in the data. The often highly ambiguous keyword query chosen by the user is then automatically translated to formal relational queries that can be unambiguously processed by the underlying query engine to compute results. This talk covers the search space GRAFinder derives from the data to capture all possible translations, the ranking scheme it uses to determine relevant candidates and the top-k procedure it employs for computing the few best ones. In greater detail, the probabilistic framework used for semantic auto-completion will be discussed.

1AgendaIntroductionGraphinder: Overview Keyword Query TranslationKeyword Query Result RankingKeyword Query RewritingSuggesting correct and meaningful queriesAuto-complete as user types

In this talk, we discuss the research challenges behind building such a relational search engine, called GRAFinder.We present an overview of the main ideas and search capabilities GRAFinder providesThen, discuss the main steps needed to perform our approach to semantic search: translation, ranking and query-rewriting needed for semantic-atuo completion2IntroductionLet me start with a story about a particular kind of data!

3

Motivation: lots of structured data

Search over structured data is relevant topic, research and commercialEmbedded Data: descriptions of entities, such as people, locations, restaurants embedded as structured data in Web documents (RDFa, Microformat, Microdata)- Through extensive usage and promotion of Web search engines (Google Rich Snippets), it can be expected that there will be more and more of this dataStructured datasets published on the Web- as Linked Data for instance, cloud comprises hundreds of datasets containing information from various domains- Through data registry such as CKAN, published datasets can be found- accessible through SPARQL endpoints Web accessible API: Facebook, LinkedIn

These data bears great potential for search technologies: All these data can be exploited to imrrove the search exp, adress information needs that coupd not supported before On the other hand: search technologies critical to untapp the values of these data; not effective use, by making them avaiilable for effective search their values can be fully expploited (main business argument for our startup project:L untapped the values of big heterogenous data in enterprises)

http://blog.programmableweb.com/2010/01/24/32-apis-used-in-7-days-amazon-bing-foursquare-google-linkedin-myspace-twitter-and-yelp/4Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces

MusicBrainzDBpedia Linkssingle written by freddie queen

singles written by freddie, who is member of the band queenFreddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswriterThe notion of semantic search has various meanings; In communities that primarily deal with structured data, e.g. semantic web and database: semantic search is about using.

in structured data: entities are represented through unique ids; and captured in terms of attribute, attribute values values queen: formed in: 1971, marital status: single Thus, every entity can be conceived as flat list of attribute-value pairsBut beyond that: structure data also captures relationships between entities; these relationships may form very complex structures that bears lots of valuable informationMember, producer

Entities and structure information captured through rel can be used to answer complex questions:- Songs written by.

With mainstream technologies for managing structured data, such as data warehouses and databases, complex questions can be supported. But have to be formulated as a structured query; for instance as a conjunction of query predicates: need to know the specific query language syntax and semantics, as well as the dataKnowing the data is not easy, especially in the bigger data scenario with several datasets: need to know schemas, and links in the data, such as same as to get information for Freddy from both DBPedia and MusicBrainz

Thus: intuitive interface neededNL questionsKeywords

5Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities

Semantic Search has not only been a hot topic in research but also in recent years, much of the insights and results from this line of research has been succesfully transferred to high profile commercial applicationsAlll major web search engines today not only search for Web pages but also make use of structured data and employs semantic search technologies for searching these data.

With structured data, we can not directly search for documents.

Given keywords or NL questions, search engines find relevant entity 6

Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries

Besides addressing information needs that could not supported before, structured data and the semantics it explicitly captures:Can be use to enhance the search experience in other ways: - for better presentation of the results, - for query construction: semantic auto-completion (not only based on similar strings but also based on the semantics of entities and relationships that can be inferred from the keywords) 7Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers single written by freddie queensingles written by freddie, who is member of the band queenQuery Translation: What are possible connections (schema-level) between recognized entities and relationships?1)

2) . Query Answering: What are actual connections (data-level) between recognized entities and relationships?1)

2)Freddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswriterWhat are the challenges / main problems in providing such semantic search capabilities? 1 Problem: finding entities; essentially IR problem, but IR approaches need to be adapted to the structured data setting2 Problem: 2 directions explored in research8Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queriesmy friends, who is member of queen{band}[id:Queen1]Queen1queen[member-of-v]is member ofmember()member[member-vp]is member of [id:1]member(x,Queen1)[who]who-friends[user-filter]who is member of [id:1]member(x,Queen1)[start]my friends, who is member of [id:Queen1]friends(x,me), member(x,Queen1)[user-head]my friendsfriends(x,me)Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees

[start] [users] [users] my friends friends(x, me)[] is member of [bands] member(x, $1)[bands] {band} $1Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships?How does Facebook solves these challenges? We look at GRaphSearch, recently launched. It is very encouraging to see that many of our ideas published in the past 5 years can be found in GraphSearch: - in particular, it is based on Query translation over data graphs, use internal query engine to compute results - and also, just like our previously presented prototypes, it uses a LM-based based ranking model for solving first problem

Difference: is its approach to the 2. problem: use a grammarIntuitively: grammar is a set of production rules, capturing all possible connections between entities and relationships (words), i.e. the search space, consisting of all possible parse trees, each representing a query Every node of the produced parse tree, is associated with display text and the formal meaning the can be predicates: in this way the produced parse tree can be used as an internal query, and as a query for the user

Goal: among all possible parse tree captures by this search space, find the parse tree that represents the intended query

1) When parsing query, query elements are matched against terminal symbols of the grammar (leave nodes of the parse trees)2) Find production rules that produces the tree

Grammar can be seen as a template, using the manually specified tempaltes, query translations are computedSuch a TEMPLATE-based requires efforts in formulating and maitaning the grammar: this is very practical solution for Facebook, it focus on the particular domain of social network data; thus grammar need to cover only a limited mumber of entity types and relationships

NB: use the SPARQL like syntax with triple pattern for consistency; fb queries are similar, conjunction of predicates where every predicate can be also rewritten as a triple patternRequires specification of predicate!!!

9OverviewFinding substructures matching keyword nodesDifferent result semantics for different types of dataCommonly used results: Steiner Tree, Steiner GraphConnect keyword matching elementsAND-Semantics: contain one keyword matching element for every query keyword Minimal substructure heuristic: prefer closely connected keyword nodes / compact results

Finding Steiner Tree, Group Steiner Tree, Connecting Subgraph (Steiner Graph) is NP-HardKeywords might produce large number of matching elements in the graphGraph might be large in sizeLarge number of (irrelevant) resultsEfficiency of finding top-k resultsEffectiveness of ranking results

Finding substructures matching keyword nodesDifferent result semantics for different types of dataCommonly used results: Steiner Tree, Steiner GraphConnect keyword matching elementsAND-Semantics: contain one keyword matching element for every query keyword Minimal substructure heuristic: prefer closely connected keyword nodes / compact results

Closly related: based on Proximity / minimal distance assumption

Different result semantics for different types of dataTextual data (Web pages connected via hyperlink)DB (tuple connected via foreign keys)XML (elements connected via parent-child edges)RDF graphs, hybrid data graphs

Different semantics:XML In an XML tree, every two nodes are connected through their LCA.Not all connected trees are relevant, even if the size is small.The focus is defining query results to prune irrelevant subtrees

Steiner graphA connected tree in G that spans a set of node SiSi are collectively relevant to the query (keyword matching elements)There is one keyword element for every keyword Group Steiner tree

Group Steiner Tree [Li et al, WWW01]Spanning from one node from each groupEach group comprise elements matching a particular keywords

Efficiency is a problem / hardNo structure constraintsKeywords might produce large number of matching elements in the data graphThe data graph might be large in sizeSearch complexity increases substantially with the size of the graphLarge number of results

Data can generally be regarded as graphs

Different return node semantics [Amer-Yahia et al, VLDB05]

Root of treeAnalyzing keyword patterns Analyzing data semantics (entity, attributes)Result presentation Keywords can specify predicates or return nodes.Q1: SIGMOD, BeijingQ2: SIGMOD, location

Return nodes may also be implicit.Q1: SIGMOD, Beijing return node = conf

Information (subtree) of return nodes are potentially interesting, and considered as relevant non-matches.explicit return nodes: analyzing keyword match patternsImplicit return nodes: analyzing data semantics (entity, attribute) [Kimelfeld et al. SIGMOD 09 (demo)]

10

Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs

Sem. Auto-completion - Entity + Relationships - Multi-source- Domain-independent- Low manual effort

Freddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswriterQuery TranslationOur approach is also a query translation apprach.In particular, it has the following two main featuresSemantic auto-completion: suggest valid & meaningful queries as user typesQuery translation: translate keyword queries to structured queries

Entity and relationalMulti-source: translation consider multiple datasets, and links between these datasets; much of our current research in fact is focussed on computing duplicates in different datasets amd establishing links between datasets ontheflyDomain-independent: does not assume a specific domain In particular, it does not require templates that have to be specified upfront or customized to a particular domainThus, it can be used for more heterogenous scenarios with multiple domains or for scenarios where datasets evolve quickly Reduced manual effort means low costMulti-source

Zero Manual Effort Does not require expert to specify search forms (E-commerce search), structure templates, translation rules and domain adaptation (Wolfram Alpha, Watson) Interpretation of keywords and structural context, i.e. relevant relations between entities, through on-the-fly graph exploration

Semantics of keywords are encodedUse search forms in which the semantics of the keyword inputs are manually encodede.g. entered keyword inputs denote name, price and some other attributes of the resources to be retrievedOther systems such use structure templates and translation rules to encode these semantics of keywords and to translate these semantics to formal structured queriesWe provide generic mechanism for interpretation of keywords, to discover these semantics on-the-fly without manual effort

11Graphinder: selected publicationsOn-demand, domain-independent, relational keyword search over data graphsStructure index for data graphs (TKDE13b)Top-k exploration of translation candidates (ICDE09)Index-based materialization of graphs (CIKM11a)Ranking results using structured relevance model (SRM) (CIKM11b) Multi-sourceDeduplication using inferred type information: TYPifier (ICDE13), TYPimatch (WSDM13)On-the-fly deduplication using SRM (WWW11)Ranking with deduplication (ISWC13)Routing keyword queries to relevant data graphs (TKDE13a)Hermes: keyword search over heterogeneous data graphs (SIGMOD09)Semantic auto-completion Computing valid query rewrites for given keywords (VLDB14)

Here, we present a list of selected publications that capture the main ideas behind Grafinder (you can find details in the list of references)Today: I will mainly talk about the how GRAFinder derives the search space using structure indexes instead of using templates, find the translation, rank the results, and perform auto-completeRecently, we have spend much time on the problem of computing duplicates, computing duplicates on-the-fly for a given query, and to integrate suplicates into search results ranking If there is interest, maybe we can find time to talk about these offline. Also, you can find detail information in the references. 12Query TranslationFinding substructures matching keyword nodesDifferent result semantics for different types of dataCommonly used results: Steiner Tree, Steiner GraphConnect keyword matching elementsAND-Semantics: contain one keyword matching element for every query keyword Minimal substructure heuristic: prefer closely connected keyword nodes / compact results

Finding Steiner Tree, Group Steiner Tree, Connecting Subgraph (Steiner Graph) is NP-HardKeywords might produce large number of matching elements in the graphGraph might be large in sizeLarge number of (irrelevant) resultsEfficiency of finding top-k resultsEffectiveness of ranking results

Finding substructures matching keyword nodesDifferent result semantics for different types of dataCommonly used results: Steiner Tree, Steiner GraphConnect keyword matching elementsAND-Semantics: contain one keyword matching element for every query keyword Minimal substructure heuristic: prefer closely connected keyword nodes / compact results

Closly related: based on Proximity / minimal distance assumption

Different result semantics for different types of dataTextual data (Web pages connected via hyperlink)DB (tuple connected via foreign keys)XML (elements connected via parent-child edges)RDF graphs, hybrid data graphs

Different semantics:XML In an XML tree, every two nodes are connected through their LCA.Not all connected trees are relevant, even if the size is small.The focus is defining query results to prune irrelevant subtrees

Steiner graphA connected tree in G that spans a set of node SiSi are collectively relevant to the query (keyword matching elements)There is one keyword element for every keyword Group Steiner tree

Group Steiner Tree [Li et al, WWW01]Spanning from one node from each groupEach group comprise elements matching a particular keywords

Efficiency is a problem / hardNo structure constraintsKeywords might produce large number of matching elements in the data graphThe data graph might be large in sizeSearch complexity increases substantially with the size of the graphLarge number of results

Data can generally be regarded as graphs

Different return node semantics [Amer-Yahia et al, VLDB05]

Root of treeAnalyzing keyword patterns Analyzing data semantics (entity, attributes)Result presentation Keywords can specify predicates or return nodes.Q1: SIGMOD, BeijingQ2: SIGMOD, location

Return nodes may also be implicit.Q1: SIGMOD, Beijing return node = conf

Information (subtree) of return nodes are potentially interesting, and considered as relevant non-matches.explicit return nodes: analyzing keyword match patternsImplicit return nodes: analyzing data semantics (entity, attribute) [Kimelfeld et al. SIGMOD 09 (demo)]

130) Query Translation: constructing pseudo schema graph representing all possible connections between data elementsStructure index for data graph: nodes are groups of data elements that are share same structure patternParameters: structure pattern with edge labels L and paths of maximum length nPseudo schemaNode groups all instances that have same set of propertiesstructure pattern: all properties, i.e. all outgoing paths with n = 1, L = all edge labelsAlgorithm:Start with one single partition/node representing all instancesSpit until all nodes are stable, i.e., all contained instances share same structure pattern

Freddie MercuryBrianMayQueenQueen Elizabeth 1LiarsinglePersonArtistSinglemembermemberproducermaritalstatuswriterPersonArtistThing12SingleValue2memberproducerwritermarital statusFor finding possible connections we compute as an offline stepPSG needed because for data you can find on the web today, schema graph not available or not complete To obtain a PSG we extend the notion of structure index to data graphs Node: groupsAll values connected with Person via martital stausAll things member of artist and producer writer singleStructure pattern can be parameterized

node artists groups all instances that have member, producer and writer as properties- Base on the algorithm for forward bisimulation presented in [16]which essentially, is an extension of Paige & Tarjans algorithm[17] for determining the coarsest stable refinementof a partitioning. This algorithm starts with a partitionconsisting of one single extension that contains all nodesfrom the data graph. This extension is successively splitinto smaller extension until the partition is stable, i.e., thegraph formed by the partition is a complete bisimulation.In order to support the parameterization proposed previously,we make the following modifications to this algorithm:141) Query Translation: constructing search space representing all possible interpretations of query keywordsFreddie MercuryQueenQueen Elizabeth 1singlePersonArtistBandSingleLiteralmemberproducerwritermarital statusFreddie MercuryQueenQueen Elizabeth 1singleSinglewriterwritten by freddie queen single

Data Index

SchemaIndexKeyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elementsSearch Space Construction: augment pseudo schema with query-specific keyword matching elements All possible connections of predicates applicable to recognized query keywordsTop-k Subgraph ExplorationResult Retrieval & Ranking152) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elementsFreddie MercuryQueenQueen Elizabeth 1singlePersonArtistBandSingleLiteralmemberproducerwritermarital statuswritten by freddie queen single

Algorithm: score-directed top-k Steiner graph searchStart: explore all distinct paths starting from keyword elementsEvery iterationOne step expansion of current path with highest scoreWhen connecting element found, merge paths and add resulting graph to list Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be exploredTermination: all paths of maximum length d have been exploredFinal step: mapping rules to translate Steiner graph to structured query

Graph to query mappingTranslation rules that map top ranked graphs to structured queries (SQL, SPARQL)Translation rules that map structured queries to natural language questionsGraph matchingTriple index: cover index supporting different triple patternsVarious join implementations

(score of k-ranked query graph)

16Result Ranking

17Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure Structured LMs for structured results rStructured LM for queries using structured pseudo-relevant feedback results FR (relevance model)Compute distance between query and result LMs

Our work on ranking is based on IR literatures However, we need to adopt IR approaches to the structured data setting, solve problems that are specific to this and exploit opportunities that arise from using structured data: while keyword query is short

Instead of ad-hoc normalizations and heuristics, we proposed a principled method based on the use of language models that represent documents and queries as multinomial distributions over words of the vocabulary. In particular, we build upon the idea of relevance model where query and documents are assumed to be generated from a hidden model of the information need- Keyword query is short an ambiguous, while data (and results) provide rich structure information that can be exploited!Principled approach to relevance based on language models and PRF estimate model from content and structure of PRF resultsAdopt relevance model as a fine-grained model representing both content and structure of relevant document and queries (relevance class)

18Relevance Models

F Documents

Candidate Documents

QueryTerm probabilities of query model is based on documentsRanking behaves like similarity search between pseudo-relevant feedback documents and corpus documents

freddie queenMercuryBrianMayProtestRaidClashBankWest

MercuryBrianMayProtestRaidClashBankWest19Structured Relevance Models

Query

F ResultsStructured DataTerm probabilities of query model is based on pseudo-relevant structured dataRanking behaves like similarity search between pseudo-relevant structured results and structured result candidates

Structured Data

queen single

MercuryBrianMayProtestRaidClashBankWest

MercuryBrianMayProtestRaidClashBankWestCandidate ResultsConstruct query model from structured data elements that are close to the queryIndex resources in the data graph where resources are treated as documents and attributes and attribute values are indexed as document terms use standard inverted index implementation and IR search engine to retrieve resources for a given keyword query initial run of the query yields F results

20

Importance of resource r w.r.t. queryProb of observing term v in value of property e of resource rvRMname RMcommentRMx

Mercury.091.01Brian.082.01Champion.081.02Protest.001.042Raid.006.014Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance

vRMname RMcommentRMx

Mercury.073.01Brian.052.01

For all resources r in FRQuery model: probability of terms in the query model is estimated using F resources: intuitively, probability of a term is estimated as the probability of observing these terms in the F resources (based on the probability of observing the term in the e-value of r, and the probability of e) Weight by the importance of that resource: a resource is more important if query terms are more likely to be observed in that resources, compared to other resources in F

Edge-specific resource model: probability of observing term v in e-value of r, smoothing with prpobability of observing term v in all values of r

The score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:Aggrgated over EVERY E: Alpha allows to control the importance of edges

Instead of single entities, ranking complex graphs comprising multuple entities, called Joined Result Tuple: model complex results as a geometric mean of the entity models

Ranking aggregated JRTs: The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms)

A language model is constructed for every attribute of the resource to capture the probability of a word being observed via repeated sampling from the content of a specific attribute of rLambda controls the weight of the edge-specific attribute, small value means less emphasis on the term of the attribute and more emphasis on the terms of the entire resource (terms in all attributes)

Pe is the probability of observing a word v in the edge specific attribute a P* is the probability of observing a word v in all attributes of r

Consider the co-occurences of a word and query words in the content of a specific attribute aThe sampling process we implement is iidiid samping: query words and w are iid sampled from a unigram distribution a, i.e. representing content of the specific attribute a, then sample v from a, and then sample k times query words from a distribution representing the content of all attributes of r

21Query Rewriting

22Query Rewriting: find syntactically and semantically valid rewrites to suggest as user typesFreddie MercuryQueenQueen Elizabeth 1singleSinglewritersingle from freddy mercury que

Data Index

SchemaIndexKeyword Interpretation: Imprecise / fuzzy matchingMatch every keywordToken rewriting via syntactic distanceSearch Space Construction1) single from freddie mercury queenToken rewriting via semantic distancesingle writer freddie mercury queenFreddie MercuryQueenSinglewriter

Data Index

SchemaIndexQuery segmentationsingle writer freddie mercury queenSearch Space ConstructionResult Retrieval & RankingKeyword / Key Phrase Interpretation: Precise matchingMatch keyword and key phrasesBenefits:Higher selectivity of query terms (quality)Reduced number of query terms (efficiency) Better search experienceChallenges: many rewrite candidates, some are semantically not valid in the relational settingsingle (marital status) writer freddie mercury queen (the queen of UK)High selectivity: exact matching produce less candidates, further clean keywords help searching more promising keyword elementBetter search experience: user does not have to type in whole queries, receive a form of guidance, learn the capabilties of the system

Challenge: Tasks are: identify syntactic and semantic variants, find phrases: in the single entity setting all these queries produce resultsNot straightforward in the relational settings: keywords are not connected, thus query rewrites produce semanticcally odd queries that may not yield any results

23

Token Rewriting: S is ranked high when prob that query Q can be observed in S is highQuery Segmentation: S is ranked high when prob that S can be observed in the data D is highProbability users write spelling errors / semantically related query independent of data DConstant given query Q and data DBased on Bayes TheoremFreddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswritersingle writer freddy mercury que1) single writer freddie mercury queen2) single writer freddrick mercury monarch3) song writer freddrick mercury head of state

Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query Alternatively, we can also refer to P as probability of S beinging the intended query rewrite, given the user query and the data First term: the prob of users writing the query given the intended query is S and the data DSecond term: prob of intended query iProbability of making errors ..

Token Rewriting: intuitively, P is high for query entered by the user that are syntactically or semantically close to the entered query, hence, syntactically semantically close queries will be suggested to the user a a result of token rewirting Query segmentation: intuitively, P is high for a given S, when S can be found in the data // moreover when S can be found in the results to query

where P(Q|S,D) models the likelihood of observing the user query Q given that the intended query is actually S. P(S|D) and P(Q|D) are the probabilities of observing S and Q respectively, given D.

Given the query Q and the underlying data D, the probability of a query rewrite S can be calculated based on Bayes theorem. The first term in the numerator models the likelihood of observing the user query Q given the intended query is actually S. We assume that the probability users make spelling errors and semantically related query is independent of the underlying data such that Q is only dependent on S. The second term in the numerator is the probablity of observing query rewrite S given D. The denominator is fixed given the query Q and the data D and can be considered as a constant.

Therefore, the probability of S given Q and D is determined by two important factors. The first is the probablity of Q given S, which reflects the problem of token rewriting. The second is the probability of S given D, which corresponds to query segmentation. 24

Token RewritingModeling token rewriting P(Q|S)

Independence assumption

Modeling syntactic and semantic differences

single writer freddy mercury que1) single writer freddie mercury queen2) single writer freddrick mercury monarch3) single writer freddrick mercury head of state

Split: | Concatenate: +single | writer | freddie + mercury | queenP(q|t): is high when q is syntactically and semantically close to tQuery rewrite S can be seen as a seqeunce of term action pairs, actions are Through concatenation, we can form a key phrase that contains of more than one term

Single token rewrite: syntactic and semantic variants and corrections are only suggested for the current single term

For modeling token rewriting, since it is not relevent to the actions made for query segmentation, the query rewrite S can be represented as a sequence of tokens by removing the actions. In addition, we assume that query keywords are independent such that each query keyword is only dependent on its correpsonding token. The probability P(w_i|t_i) models the likelihood of observing a query keyword w_i, given the intended token is t_i.We distribute the probability P(w_i|t_i) inverse proportionally to the syntatic and semantic differences, measured by the edit distance and semantic distance between query keyword w_i and token rewrite t_i, respectively.25Query SegmentationModeling query segmentation P(S|D)

Nth order Markov assumption

where PD(iti+1|t11t2i-1ti) stands for P(iti+1|t11t2i-1ti,D). single writer freddie mercury queFreddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswritersingle writer freddie = concatenate? = split? Shorthand: PD

For modeling query segmentation, the probability of a query rewrite S can be represented as the product of the probabilities of all token and action pairs given the previous generated sequence using the chain rule. However, for keyword queries with a large number of keywords, computing P(S|D) will incure prohibitive cost when D is large in size. To address this problem, we make the Nth order Markov assuption to approximate that the probability of an action on a token only depends on the context: N preceeding tokens and actions.

26Estimating Probability of Segmentation Maximum likelihood estimation (MLE)

where C(titj) denotes the count of occurrences of the token sequence titj Segmentation in structured data settingConcatenate two segments si and sj when they co-occur in the dataSplit when si and sj are connected (si sj), i.e., when the two data elements ni and ni mentioning si and sj are connected in the datasingle writer freddie mercury queenFreddie MercuryBrianMayQueenQueen Elizabeth 1Liar1971singlePersonArtistSinglemembermemberproducerformed inmaritalstatuswritersingle writer freddie = concatenate? = split? To estimate the probability of an action on a token given the N preceding tokens and actions, we can adopt the idea from language modeling. A typical task in language modeling is to predict the next token based on the probability of the token given the preceding context. Using the maximum likelihood estimation, this probablity can be estimated as

the count of co-occurrences of the context and token ti+1 divided by the sum of counts of all tokens that share the same context.

How likely do we observe token being cooccur with context on the data?

Since the language model is designd to model unstructured data, it can not capture the structural information. The event considered here is occurrence of the token sequence. In the structured data setting, we need to take into account the structural information.

The intuition is that the tokens in a segment resulting from a concatenation action are supposed to co-occure in the data. For the splitting action, the segments separated by a splitting action are supposed to be connected.

Thus we extend the language model to take into account the connections between the segments. 27

Two cases: (1) l(si) N; (2) l(si) < N(1) When the previously induced segment si has length equal or more than N, i.e. l(si) N, it suffices to focus on si (N) to predict the next action i on ti+1

Estimation of probability

where C(st) denotes the count of co-occurrences of the sequence st in D and C(s t) is the count of all occurrences of token t connected to segment s Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N freddie j. mercury queen freddie j. mercury queen Basically, there are two cases to consider: In the first case, to estimate the probability of the concatenation action on token ti+1, the event we need to consider is the co-occurrences of the sequence si(N) and ti+1. To estimate the probability of the splitting action on token ti+1, the event we should consider is occurrences of token ti+1 that is connected to segment si(N).

28

(2) When the previous segment si has length less than N, i.e. l(si) < N, the action i on the next token ti+1 depends on si and Pi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e.,

Estimation of probability

where C(P s) denotes the count of all occurrences of the segment s connected to all segments in PEstimating Probability of Segmentation Case 2: previous segment si has length less than context N single writer freddie mercurysingle writer freddie mercuryIn the second case, to estimate the probability of the concatenation action on token ti+1, the event we need to consider is the occurrences of the segment si(N)ti+1, namely the segment si+1 connected to all segments in Pi(N). To estimate the probability of the splitting action on token ti+1, the event we should consider is occurrences of token ti+1 that is connected to all segments in Pi(N) as well as si.29Experimental results & Conclusions30Graphinder, a relational keyword search approach for suggesting query completions, translating queries and ranking results Keyword translation performanceQuery translation and index-based approaches at least one-order of magnitude faster than online in-memory search (bidirectional) Query translation comparable with index-based approaches, but less spaceKeyword translation result qualityAccording to recent benchmark, our ranking consistently outperforms all existing ranking systems in precision, recall and MAP (10% - 30% improvement)Effect of query rewriting Better user experienceImproves efficiency by reducing number of query termsImproves quality / selectivity of query termsdepends on complexity of queries and underlying keyword search engine Tight integration of query suggestion and translationFrom research prototypes to Graphinder, a powerful, flexible, low upfront-cost semantic search systemCompared with EASE, our index-based solution (not discussed here) reduces storage requirement up to 86%, improves performance by more than 50%

Improves quality for dirty queries (due to token rewriting); further, query segmentation helps to improve the selectivity of query terms

The performance ofPVQR is consistently better than the other two systems for bothdatasets. PVQR is about 3-4 times faster than BQR for IMDband about 2 times faster for Wikipedia. These differences areprimarily due to the pruning capability of PVQR, i.e., PVQRprunes non-valid results. Compared to PQR, the amount ofvalid sub-query rewrites that have to be kept track of is smaller.The amount of partial rewrites (segments) that are consideredby BQR is even much larger than PQR, as it does not focuson the context but considers all possible combinations ofpreviously obtained segments.31Thanks!

Tran Duc [email protected]://sites.google.com/site/kimducthanh/32References (1)[VLDB14] Yongtao Ma, Thanh TranProbabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph DataInInternational Conference on Very Large Data Bases (VLDB'14).Hangzhou, China,September, 2014[ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly ConsolidationInInternational Semantic Web Conference (ISWC'13).Sydney, Australia,October, 2013[ICDE13] Yongtao Ma, Thanh TranTYPifier: Inferring the Type Semantics of Structured DataInInternational Conference on Data Engineering(ICDE'13).Brisbane, Australia,April, 2013[WSDM13] Yongtao Ma, Thanh TranTYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data IntegrationInInternational Conference on Web Search and Data Mining(WSDM'13).Rome, Italy,February, 2013[TKDE12a] Thanh Tran, Gnter Ladwig, Sebastian RudolphManaging Structured and Semi-structured RDF Data Using Structure IndexesIn Transactions on Knowledge and Data Engineering journal.[TKDE12b] Thanh Tran, Lei ZhangKeyword Query RoutingInTransactions on Knowledge and Data Engineeringjournal.References (2)[WWW12] Daniel Herzig,Thanh TranHeterogeneous Web Data Search Using Relevance-based On The Fly Data IntegrationInProceedings of21st International World Wide Web Conference (WWW'12).Lyon, France,April, 2012[CIKM11a] Gnter Ladwig,Thanh TranIndex Structures and Top-k Join Algorithms for Native Keyword Search DatabasesInProceedings of20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011[CIKM11b] Veli Bicer, Thanh TranRanking Support for Keyword Search on Structured Data using Relevance ModelsInProceedings of20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011[SIGIR11] Roi Blanco,Harry Halpin,Daniel M. Herzig,Peter Mika,Jeffrey Pound,Henry S. Thompson,Thanh Tran DucRepeatable and Reliable Search System Evaluation using CrowdsourcingInProceedings of34th Annual International ACM SIGIR Conference(SIGIR'11), Beijing, China, July, 2011 [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009[SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi StuderHermes: A Travel through Semantics in the Data WebInProceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009Backupsuch as RDF, RDFa and Linked Data! How can we leverage this for enhancing the search experience?

35