graphinder semantic search relational keyword search over data graphs

Graphinder Semantic SearchRelational Keyword Search over Data Graphs

Thanh Tran, Lei Zhang, Veli Bicer, Yongtao MaResearcher: www.sites.google.com/site/kimducthanhCo-Founder: www.graphinder.com

Agenda• Introduction• Graphinder: Overview • Keyword Query Translation• Keyword Query Result Ranking• Keyword Query Rewriting

– Suggesting correct and meaningful queries– Auto-complete as user types

INTRODUCTION

Motivation: lots of structured data

Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces

<x, type, Single> <Freddie Mercury, writer, x><Freddie Mercury, type, Artist><Freddie Mercury, member, Queen><Queen, type, Band>

<x, type, Single> <x, wrritenBy, Freddy>

MusicBrainz

DBpedia

<Freddy, same-as, Freddy Mercury> Links

“single written by freddie queen”

“singles written by freddie, who is member of the band queen”

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar 1971 single

PersonArtist Single

member

mem

ber producer

formed in

marital

status

writer

Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities

Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries…

Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers

“single written by freddie queen”

“singles written by freddie, who is member of the band queen”

Query Translation: What are possible connections (schema-level) between recognized entities and relationships?

1)<x, type, Single> <Freddie Mercury, writer, x><Freddie Mercury, member, Queen>2) …. Query Answering: What are actual connections (data-level) between recognized entities and relationships?

1)<Liar Liar, type, Single> <Freddie Mercury, writer, Liar Liar><Freddie Mercury, member, Queen>2)…

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar 1971 single

PersonArtist Single

member

mem

ber

producer

formed inm

arital

status

writer

Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queries

“my friends, who is member of queen”

{band}[id:Queen1]

Queen1

queen

[member-of-v]is member of

member()

member

[member-vp]is member of [id:1]member(x,Queen1)

[who]who

-

friends

[user-filter]who is member of [id:1]

member(x,Queen1)

[start]my friends, who is member of [id:Queen1]

friends(x,me), member(x,Queen1)

[user-head]my friends

friends(x,me)

Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees

[start] [users] [users] my friends friends(x, me)[…] is member of [bands] member(x, $1)[bands] {band} $1…

Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships?

OVERVIEW

Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs

Sem. Auto-completion

- Entity + Relationships - Multi-source- Domain-independent- Low manual effort

Freddie Mercury BrianMay

Queen

Queen Elizabeth 1

Liar 1971 single

PersonArtist Single

member m

embe

r

producer

formed in

marital

status

writer

Query Translation

Graphinder: selected publications• On-demand, domain-independent, relational keyword search

over data graphs– Structure index for data graphs (TKDE13b)– Top-k exploration of translation candidates (ICDE09)– Index-based materialization of graphs (CIKM11a)– Ranking results using structured relevance model (SRM) (CIKM11b)

• Multi-source– Deduplication using inferred type information: TYPifier (ICDE13),

TYPimatch (WSDM13)– On-the-fly deduplication using SRM (WWW11)– Ranking with deduplication (ISWC13)– Routing keyword queries to relevant data graphs (TKDE13a)– Hermes: keyword search over heterogeneous data graphs (SIGMOD09)

• Semantic auto-completion – Computing valid query rewrites for given keywords (VLDB14)

QUERY TRANSLATION

0) Query Translation: constructing pseudo schema graph representing all possible connections between data elements

• Structure index for data graph: nodes are groups of data elements that are share same structure pattern

• Parameters: structure pattern with edge labels L and paths of maximum length n

• Pseudo schema– Node groups all instances that have

same set of properties– structure pattern: all properties, i.e.

all outgoing paths with n = 1, L = all edge labels

• Algorithm:– Start with one single partition/node

representing all instances– Spit until all nodes are “stable”, i.e.,

all contained instances share same structure pattern

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar single

PersonArtist Single

member

mem

ber producer

marital

status

writer

PersonArtist Thing12 Single Value2

member producer writer marital status

1) Query Translation: constructing search space representing all possible interpretations of query keywords

Freddie Mercury

Queen Queen Elizabeth 1

single

PersonArtist Band Single Literal


Freddie Mercury Queen

Queen Elizabeth 1 single

Singlewriter

“written by freddie queen single”

Data Index

SchemaIndex

Keyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elements

Search Space Construction: augment pseudo schema with query-specific keyword matching elements • All possible connections of predicates

applicable to recognized query keywords

Top-k Subgraph Exploration

Result Retrieval & Ranking

2) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elements

Freddie Mercury

Queen Queen Elizabeth 1

single

PersonArtist Band Single Literal


“written by freddie queen single”

<x, type, Single> <Queen, producer, x><Freddie Mercury, writer, x><Queen, type, Band><Freddy Mercury, type, Artist>

• Algorithm: score-directed top-k Steiner graph search• Start: explore all distinct paths starting from keyword elements• Every iteration

• One step expansion of current path with highest score• When connecting element found, merge paths and add resulting graph to list

• Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be explored

• Termination: all paths of maximum length d have been explored• Final step: mapping rules to translate Steiner graph to structured query

RESULT RANKING

Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure

• Structured LMs for structured results r

• Structured LM for queries using structured pseudo-relevant feedback results FR (relevance model)

• Compute distance between query and result LMs

)|()( rvPvRM r

)|()( rF FvPvRMr

Vv

rF vRMvRMrScorer

)(log)()(

Relevance Models

F Documents

Candidate Documents

Query

• Term probabilities of query model is based on documents

• Ranking behaves like similarity search between pseudo-relevant feedback documents and corpus documents

freddie queen

MercuryBria

nMayProtest

RaidClas

hBankWest

MercuryBria

nMayProtest

RaidClas

hBankWest

Structured Relevance Models

Query F Results

Structured Data

• Term probabilities of query model is based on pseudo-relevant structured data

• Ranking behaves like similarity search between pseudo-relevant structured results and structured result candidates

Structured Data

queen single

MercuryBria

nMayProtest

RaidClas

hBankWest

MercuryBria

nMayProtest

RaidClas

hBankWest

Candidate Results

Importance of resource r w.r.t. query

Prob of observing term v in value of

property e of resource r

v RMname RMcomment RMx

Mercury .091 .01 …

Brian .082 .01 …

Champion .081 .02 …

Protest .001 .042 …

Raid .006 .014 …

… … … …

Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance

v RMname RMcomment RMx

Mercury .073 .01 …

Brian .052 .01 …

… … … …

For all resources r

in FR

QUERY REWRITING

Query Rewriting: find syntactically and semantically valid rewrites to suggest as user types


Queen Elizabeth 1 single

Singlewriter

single from freddy mercury que

Data Index

SchemaIndex

Keyword Interpretation: - Imprecise / fuzzy matching- Match every keyword

Token rewriting via syntactic distance

Search Space Construction

1) single from freddie mercury queen…

Token rewriting via semantic distance

1) single writer freddie mercury queen…


Singlewriter

Data Index

SchemaIndex

Query segmentation

1) single writer “freddie mercury” queen…

Search Space Construction

Result Retrieval & Ranking

Keyword / Key Phrase Interpretation: - Precise matching- Match keyword and key phrases

Benefits:- Higher selectivity of query terms (quality)- Reduced number of query terms (efficiency) - Better search experience…

Challenges: many rewrite candidates, some are semantically not “valid” in the relational settingsingle (marital status) writer “freddie mercury” queen (the queen of UK)

Token Rewriting: S is ranked high when prob

that query Q can be observed in S is high

Query Segmentation: S is ranked high when prob that

S can be observed in the data D is high

Probability users write

spelling errors /

semantically related query

independent of data D

Constant given query Q

and data D

Based on Bayes‘ Theorem

Freddie Mercury BrianMay

Queen

Queen Elizabeth 1

Liar 1971 single

PersonArtist Single

membe

r

mem

ber

producer

formed in

marital

status

writer

single writer freddy mercury que

1) single writer freddie mercury queen2) single writer freddrick mercury monarch3) song writer freddrick mercury head of state

Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query

Token Rewriting

• Modeling token rewriting P(Q|S)

• Independence assumption

• Modeling syntactic and semantic differences

single writer freddy mercury que

1) single writer “freddie mercury” queen2) single writer “freddrick mercury” monarch3) single writer “freddrick mercury” head of state

Split: | Concatenate: +

single | writer | freddie + mercury | queen

P(q|t): is high when q is syntactically and

semantically close to t

Query Segmentation

• Modeling query segmentation P(S|D)

• Nth order Markov assumption

where PD(αiti+1|t1α1t2…αi-1ti) stands for P(αiti+1|t1α1t2…αi-1ti,D).

single writer freddie mercury que

Freddie Mercury

BrianMay

Queen

Queen Elizabeth 1

Liar

1971

single

PersonArtist

Single

member

m e m b e r

producer

formed i n

marita

lstatus

writer

single writer freddie

α = concatenate? α = split?

Estimating Probability of Segmentation

• Maximum likelihood estimation (MLE)

where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj

Segmentation in structured data setting• Concatenate two segments si and sj when they co-occur in the data• Split when si and sj are connected (si sj),↭ i.e., when the two data

elements ni and ni mentioning si and sj are connected in the data

single writer freddie mercury queen

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar 1971 single

PersonArtist Single

member

mem

ber producer

formed in

marital

status

writersingle writer freddie

α = concatenate? α = split?

• Two cases: (1) l(si) ≥ N; (2) l(si) < N• (1) When the previously induced segment si has length equal or

more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict the next action αi on ti+1

• Estimation of probability

where C(st) denotes the count of co-occurrences of the sequence st in D and C(s ↭ t) is the count of all occurrences of token t connected to segment s

Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N

freddie j. mercury queen freddie j. mercury queen

• (2) When the previous segment si has length less than N, i.e. l(si) < N, the action αi on the next token ti+1 depends on si and Pi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e.,

• Estimation of probability

where C(P ↭ s) denotes the count of all occurrences of the segment s connected to all segments in P

Estimating Probability of Segmentation Case 2: previous segment si has length less than context N

single writer freddie mercury single writer freddie mercury

EXPERIMENTAL RESULTS & CONCLUSIONS

• Graphinder, a relational keyword search approach for suggesting query completions, translating queries and ranking results

• Keyword translation performance– Query translation and index-based approaches at least one-order of magnitude

faster than online in-memory search (bidirectional) – Query translation comparable with index-based approaches, but less space

• Keyword translation result quality– According to recent benchmark, our ranking consistently outperforms all

existing ranking systems in precision, recall and MAP (10% - 30% improvement)• Effect of query rewriting

– Better user experience– Improves efficiency by reducing number of query terms– Improves quality / selectivity of query terms– …depends on complexity of queries and underlying keyword search engine

• Tight integration of query suggestion and translation• From research prototypes to Graphinder, a powerful, flexible, low upfront-cost

semantic search system

Thanks!

Tran Duc [email protected]

http://sites.google.com/site/kimducthanh/

References (1)– [VLDB14] Yongtao Ma, Thanh Tran

Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph DataIn International Conference on Very Large Data Bases (VLDB'14). Hangzhou, China, September, 2014

– [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly ConsolidationIn International Semantic Web Conference (ISWC'13). Sydney, Australia, October, 2013

– [ICDE13] Yongtao Ma, Thanh TranTYPifier: Inferring the Type Semantics of Structured DataIn International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013

– [WSDM13] Yongtao Ma, Thanh TranTYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data Integration

In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013

– [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian RudolphManaging Structured and Semi-structured RDF Data Using Structure IndexesIn Transactions on Knowledge and Data Engineering journal.

– [TKDE12b] Thanh Tran, Lei ZhangKeyword Query RoutingIn Transactions on Knowledge and Data Engineering journal.

https://docs.google.com/file/d/0BzbxhrIWrCCDOUswdW9CcW9YZXc/edit

https://docs.google.com/file/d/0BzbxhrIWrCCDOUswdW9CcW9YZXc/edit

https://docs.google.com/file/d/0BzbxhrIWrCCDOW5xYWRSZlpRWm8/edit



https://sites.google.com/site/kimducthanh/publication/strucidx-tkde.pdf?attredirects=0

https://docs.google.com/file/d/0BzbxhrIWrCCDV0ppdUNUeVZxWDA/edit

References (2)– [WWW12] Daniel Herzig, Thanh Tran

Heterogeneous Web Data Search Using Relevance-based On The Fly Data IntegrationIn Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012

– [CIKM11a] Günter Ladwig, Thanh TranIndex Structures and Top-k Join Algorithms for Native Keyword Search DatabasesIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011

– [CIKM11b] Veli Bicer, Thanh TranRanking Support for Keyword Search on Structured Data using Relevance ModelsIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011

– [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using CrowdsourcingIn Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011

– [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009

– [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009

https://sites.google.com/site/kimducthanh/publication/onthefly.pdf?attredirects=0

https://sites.google.com/site/kimducthanh/publication/keywordjoin.pdf?attredirects=0

https://sites.google.com/site/kimducthanh/publication/ksRanking.pdf?attredirects=0

https://sites.google.com/site/kimducthanh/publication/crowdsourcing-eva.pdf?attredirects=0

http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/keywordtopk.pdf

http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/hermesdemo.pdf

BACKUP

graphinder semantic search relational keyword search over data graphs

Documents

semantic search research

semantic search applications

graph search

relational search engine

launched search engine

search space grafinder

main search concepts

semantic autocompletion