the selim and rachel benin school of engineering and computer science keyword proximity search in...

63
The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Keyword Proximity Search in Complex Search in Complex Data Graphs Data Graphs Konstantin Golenberg Benny Kimelfeld Yehoshua Sagiv

Upload: kaitlyn-lindsey

Post on 27-Mar-2015

248 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

The Selim and Rachel Benin School of Engineering and Computer Science

Keyword Proximity Search Keyword Proximity Search in Complex Data Graphsin Complex Data Graphs •• Konstantin Golenberg •• Benny Kimelfeld

•• Yehoshua Sagiv

Page 2: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

Page 3: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The natural (and popular) option: Keyword SearchKeyword Search

Schema-Free Extraction of DataSchema-Free Extraction of Data

Nowadays…

Exposure to many databases• Different types (relational, XML, RDF…)

• Different schemas

• Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema

• Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema

−Problem: Inherently different from standard IR

Page 4: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Data have varying degrees of structure– Relational (w/ foreign keys), XML (w/ id-references)– Natural representation by a graph – Usually, data-centric rather than document-centric

A query is a set of keywords− No structural constraints

Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)

The Goal:The Goal:

Extract meaningful parts of data w.r.t. the keywords

• Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …

Page 5: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Example: Search in RDBExample: Search in RDB

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

B135

NL135

search Belgium , Brussels

CodeNameAreaCapital

NLNetherlands3733022

BBelgium3051073

CitiesCities OrganizationsOrganizations

CountriesCountries MembershipsMemberships

Page 6: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

B135

NL135

search Belgium , Brussels

CodeNameAreaCapital

NLNetherlands3733022

BBelgium3051073

CitiesCities OrganizationsOrganizations

CountriesCountries MembershipsMemberships

Brussels is the capital city of Belgium

Page 7: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

B135

NL135

CodeNameAreaCapital

NLNetherlands3733022

BBelgium3051073

CitiesCities OrganizationsOrganizations

CountriesCountries MembershipsMemberships

Brussels hosts EU and Belgium is a member

search Belgium , Brussels

Page 8: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Example: Search in XMLExample: Search in XML

dblp

title

author

article

MihalisYannakakis

On theApproximationof MaximumSatisfiability

title

author

article

ImprovedApproximationAlgorithms for

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

cite

search Yannakakis , Approximation

Page 9: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Yannakakis wrote a paper about Approximation

dblp

title

author

article

MihalisYannakakis

On theApproximation

of MaximumSatisfiability

title

author

article

ImprovedApproximationAlgorithms for

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

cite

search Yannakakis , Approximation

Page 10: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

dblp

title

author

article

MihalisYannakakis

On theApproximationof MaximumSatisfiability

title

author

article

ImprovedApproximationAlgorithms for

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

cite

Yannakakis is cited by a paper about Approximation

search Yannakakis , Approximation

Page 11: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Data GraphsData Graphs

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Structural and keyword nodes Edges and nodes may have weights

– Weak relationships are penalized by large weights

Each keyword has one occurrence in the data graph (technical)

Page 12: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

QueriesQueries

Q={ Summers , Cohen , coffee }company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Queries are sets of keywords from the data graph

Page 13: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

An Answer is a An Answer is a Reduced SubtreeReduced SubtreeAn answer is a subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

3 variants: directed, undirected, strong (undirected, kw’s are leaves);

This paper

Page 14: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Previous SolutionsPrevious Solutions• Lack of guarantees

−Highly relevant answers might be missed, and / or− Inefficient algorithms

• Rather simple data sets – a (very) small number of relevant answers−They considered data that are essentially collections

of entities, namely, DBLP, IMDB, Lyrics, etc.−An answer is usually within the scope of an entity

→ e.g., the keywords appear in a single movie

• Crucial problems ignored− In particular, the “repeated information” problem−Especially pervasive in complex data graphs

Page 15: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

ContributionsContributions

A system for keyword proximity searchA system for keyword proximity search• An algorithm for generating answers with An algorithm for generating answers with

guaranteesguarantees−Does not miss (valuable) answers−Efficient (polynomial delay)−Answers generated in a 2-approximate order by height

• A ranking technique that is aware of the repeated-information repeated-information problem

−Gives preference to answers with low similarity to earlier ones

• Experimentation over a highly-cyclic data graph−The Mondial database−Many “meaningful” connections among keywords

Page 16: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

The MONDIAL DatabaseInstitute for InformaticsGeorg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/

Page 17: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

Page 18: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

ChallengesChallenges

• Huge no. of answers; not instantiated!Huge no. of answers; not instantiated!−Not simple to generate all relevant answers, even if

ranking is ignored−For practical ranking functions, enumerating the

answers in ranked order is probably impossible• For example, finding the smallest answer is the intractable

Steiner-tree problem

• Redundancy / repeated information−Many answers are very similar (altogether provide a

low amount information)−Crucial in complex (highly cyclic) data graphs

We employ a two-phase architecture:

Page 19: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Architecture: Generator + RankerArchitecture: Generator + Ranker

Answer GeneratorAnswer GeneratorGenerates next M·k answers

(simplified ranking function)

Answer GeneratorAnswer GeneratorGenerates next M·k answers

(simplified ranking function)

top-k answers(relative to those that

have already been printed)

• search(keywords)• next k answers

RankerRankerRanks all answers

generated up to now(- printed ones)

RankerRankerRanks all answers

generated up to now(- printed ones)

Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05]

Page 20: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

Page 21: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Generating the Top Answers: Not Trivial!Generating the Top Answers: Not Trivial!

To demonstrate the difficulty of generating

the “good” (top) answers, let’s see how existing approaches operate on a simple example:

Page 22: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Find the Answers in this Example!Find the Answers in this Example!

location

name

EU

country

city

Brussels

name

headq

organization

Page 23: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The BANKS ApproachThe BANKS Approach

∀ nodes v (in a “good” order) and keyword occurrences:

Generate the min-height subtree emanating from v

∀ nodes v (in a “good” order) and keyword occurrences:

Generate the min-height subtree emanating from v

location

name

EU

country

city

Brussels

name

headq

organization

Answers are directed subtrees

location

name

EU

country

city

Brussels

name

headq

organization

[Bhalotia et al., ICDE’02, VLDB’05]

Page 24: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The BANKS ApproachThe BANKS Approach Answers are directed subtrees

Never generated!Never generated!

What about this answer?

location

name

EU

country

city

Brussels

name

headq

organization

∀ nodes v (in a “good” order) and keyword occurrences:

Generate the min-height subtree emanating from v

∀ nodes v (in a “good” order) and keyword occurrences:

Generate the min-height subtree emanating from v

[Bhalotia et al., ICDE’02, VLDB’05]

Page 25: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The NUITS ApproachThe NUITS Approach

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

Answers are undirected subtrees[Ding et al., ICDE’07]

location

name

EU

country

city

Brussels

name

headq

organization

location

name

EU

country

city

Brussels

name

headq

organization

Page 26: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The NUITS ApproachThe NUITS Approach

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

Answers are undirected subtrees

This node is redundant

It is actually the previous answer!

[Ding et al., ICDE’07]

location

name

EU

country

city

Brussels

name

headq

organization

location

name

EU

country

city

Brussels

name

headq

organization

Page 27: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The NUITS ApproachThe NUITS Approach

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

Answers are undirected subtrees

Again, the previous answer!

[Ding et al., ICDE’07]

location

name

EU

country

city

Brussels

name

headq

organization

location

name

EU

country

city

Brussels

name

headq

organization

This node is redundant

Page 28: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The NUITS ApproachThe NUITS Approach

Never generated!Never generated!

What about this answer?

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

Answers are undirected subtrees[Ding et al., ICDE’07]

location

name

EU

country

city

Brussels

name

headq

organization

location

name

EU

country

city

Brussels

name

headq

organization

Severe limit on # of generated

answers! (≤ one per node)

Page 29: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach

∀ possible queries Q (from the schema) in inc. size:

Evaluate Q over the database

∀ possible queries Q (from the schema) in inc. size:

Evaluate Q over the database

All answers are generated in ranked order!

[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]

Easy to implement!

DBMS queries–No in-memory

graph algorithms

location

name

EU

country

city

Brussels

name

headq

organization

Page 30: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

location

name

EU

country

city

Brussels

name

headq

organization

The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach

∀ possible queries Q (from the schema) in inc. size:

Evaluate Q over the database

∀ possible queries Q (from the schema) in inc. size:

Evaluate Q over the database

But many queries do not generate

any answer at all!

Worst case: exponential in

the data

Limited Ranking!Limited Ranking!by the query (rather

than the answer) weight

[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]

Inefficient!Inefficient!

Page 31: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

We Need Generators w/ Guarantees!We Need Generators w/ Guarantees!

All answers are generatedAll answers are generated− In particular, each of the “relevant” answers is

produced at some point (100% recall is achievable)

Controlled order of answersControlled order of answers−For instance, increasing weight, increasing height,

approximate (what is the ratio?) / heuristic order

EfficiencyEfficiency−The top-k answers should be generated efficiently−Bound on time between successive answers

A B C A B C A B C A B C A B C A B C A B C

Page 32: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Order by Increasing Weight / HeightOrder by Increasing Weight / Height

IfIf ThenThen ≤≤

Top-Top-kk Answers AnswersTop-Top-kk Answers Answers

A B C A B C A B C A B C A B CA B C A B C

A B C A B C

Page 33: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Approximate and Heuristic OrdersApproximate and Heuristic Orders

Approximate orderApproximate order Heuristic orderHeuristic order

There is a provable bound on the extent to which the actual order can deviate from the optimal one

Intuitively, expected to be close to the optimal order, but there is no guarantee

Page 34: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

CC-Approximate Order (inc. Weight / Height)-Approximate Order (inc. Weight / Height)

IfIf ThenThen ≤≤

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers[Fagin et al., PODS’01]

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers[Fagin et al., PODS’01]

CC

A B C A B C A B CA B CA B CA B CA B C

A B C

A B C

Page 35: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Our ApproachOur Approach• PODS’06: Enum. by (exact / approx) inc. weight

− Problem: Repeated application of Steiner-tree alg’s− “Heavy” – hard to implement efficiently

• Here: Follow the basic approach of PODS’06

• But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order−Recall: BANKS might miss highly relevant answers

• Thus, we bypass Steiner trees and obtain a much faster algorithm

• Our alg. has all 3 guarantees: answers are not answers are not missedmissed, approximate orderapproximate order, poly. delaypoly. delay

Page 36: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Find the shortest answer (w/o constraints)

An Overview of the Algorithm An Overview of the Algorithm

Enum. by (2-approx.) increasing height

Find (a 2-approx. of) the shortest answer under constraints

TaskTask::

TaskTask::

TaskTask::

Lawler / Yen methodTypes of Constraints:• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Backward-search (Dijkstra) iterators (~ BANKS)

The intricate part …

Page 37: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Finding an Answer under ConstraintsFinding an Answer under Constraints

• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Belgium

location

country

city

Brussels

name

headq

organization

name

organization

Belgium

location

country

city

Brussels

name

headq

organization

name

organization

Handling exclusion constraints is easy

Simply remove the excluded edges from the graph

Page 38: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Belgium

location

country

city

Brussels

name

headq

organization

name

Inclusion Constraints are the ProblemInclusion Constraints are the Problem

• Inclusion: “include edge e”• Exclusion: “exclude edge e”

But it is not an But it is not an answer!answer!

Belgium

location

country

city

Brussels

name

headq

organization

name

The shortest subtree that contains the kw’s

and satisfies the const’s

Belgium

location

country

city

Brussels

name

headq

organization

name

redundant edge

• Not reduced (has redundancy)

• Moreover, includes a previously printed answer

• Sometimes, no answer at all!

Page 39: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Belgium

location

country

city

Brussels

name

headq

organization

name

Belgium

location

country

city

Brussels

name

headq

organization

name

The Correct AnswerThe Correct Answer• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Technique:

1.1. Generate a min-height subtree (as in the wrong solution)

2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings

Technique:

1.1. Generate a min-height subtree (as in the wrong solution)

2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings

Page 40: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Running TimesRunning Times

0

100

200

300

400

500

2 3 4 5 6 7 8 9 10

# keywords

Tim

e (

sec)

100 answers 1000 answers

Each entry is an avg. of 4 queries

Page 41: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Alg. Order vs. Weight OrderAlg. Order vs. Weight Order

0100200300400500600700800900

1000

10 20 30 40 50 60 70 80 90 100

Weight-Based Rank

Ge

ne

ratio

n R

an

k

2 kw's

3 kw's

4 kw's

5 kw's

6 kw's

7 kw's

8 kw's

9 kw's

10 kw's

How many answers are generated in order to obtain the top-k (among 1000) according to weight?

Each entry is an avg. of 4 queries

Page 42: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑

3 keywords

0100200300

100 1600 3100 4600 6100 7600 9100

2 keywords

%

k (answers)Effective approx. ratio

worst / best (among first k)

0100200300

100 1600 3100 4600 6100 7600 9100

Page 43: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑

5 keywords

4 keywords

%

k (answers)worst / best (among first k)

0

100

200

100 1600 3100 4600 6100 7600 9100

0

100

200

100 1600 3100 4600 6100 7600 9100

Effective approx. ratio

Page 44: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑

3 keywords

2 keywords

%

k (answers)Effective approx. ratio

worst / best (among first k)

0100200300

100 1600 3100 4600 6100 7600 9100

0100200300

100 1600 3100 4600 6100 7600 9100

Page 45: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑

5 keywords

4 keywords

%

k (answers)Effective approx. ratio

worst / best (among first k)

0100200300

100 1600 3100 4600 6100 7600 9100

0

100

200

100 1600 3100 4600 6100 7600 9100

Page 46: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

Page 47: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The Basic Ranking FunctionThe Basic Ranking Function

abs-rel(a)=1

weight(a)

weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a

Page 48: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Determining the Weight of an EdgeDetermining the Weight of an Edge

organization

country

organization...organization ...

country

bordersborders

organization

countrycountry ... country country

country

capital

Many org’s enter country → weak connection (large weight)

org. enters many countries → weak connection (large weight)

Strong connection (small weight) Strongest!

Page 49: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The Basic Ranking Function (cont’d)The Basic Ranking Function (cont’d)

abs-rel(a)=1

weight(a)

weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a

weight(node) = fixed (1)

weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2))

edge = (v1,v2)tag(vi) = ti

# t2 nodes with edges from v1

# t1 nodes with edges to v2

Relevant answers but …but …

Page 50: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

country

organization (EU)

country

NetheslandsBelgium France

country

Answers with High SimilarityAnswers with High Similarity

country

organization (ADB)

country

Netheslands Belgium France

country

country

organization (NATO)

country

Netheslands Belgium France

country

country

organization (ESA)

country

Netheslands Belgium France

country

Page 51: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Netheslands

country

Belgium

organization

country

France

borders

country

country

organization

country

NetheslandsBelgium France

country

Belgium

country

country

France

borders

country

borders

Netheslands

But each individual answer is relevant!

Combinations of ConnectionsCombinations of Connections

country

Belgium

organization

country

country

Franceborders

Netheslands

Page 52: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Dynamic RankingDynamic Ranking

country

organization (EU)

country

NetheslandsBelgium France

country

country

organization (ADB)

country

Netheslands Belgium France

country

country

organization (NATO)

country

Netheslands Belgium France

country

country

organization (ESA)

country

Netheslands Belgium France

country

Netheslands

country

Belgium

organization

country

France

borders

country

country

organization

country

NetheslandsBelgium France

country

Belgium

country

country

France

borders

country

borders

Netheslands

country

Belgium

organization

country

country

Franceborders

Netheslands

Netheslands

country

Belgium

organization

country

France

borders

country

Candidate Answers

Output

country

organization (ESA)

country

Netheslands Belgium France

countryNetheslands

country

Belgium

organization

country

France

borders

country…

NextNext--Answer()Answer()a ← extract-top-candidate()print(a)for all candidates c and pairs of keywords k1, k2

if c and a connect k1 and k2 similarly, then penalize(c)What does it mean?

Page 53: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

country

Belgium France

country

country

borders

Netheslands

organization (ESA)

country

Belgium France

country

country

borders

Netheslands

organization (ESA)

Two Types of “Similarity”Two Types of “Similarity”

country

organization (ESA)

country

Netheslands Belgium France

country

country

organization (EU)

country

NetheslandsBelgium France

countrycountry

organization (EU)

country

NetheslandsBelgium France

country

country

organization (ESA)

country

Netheslands Belgium France

country

The same connection

Isomorphic connection

(same schema)k1, k2 = Belgium, France

a c1

c2

Penalty: 1

Penalty: p (≤1)

2 options:2 options:•Sum over printed answers•Max over printed answers

Page 54: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

The General Ranking FunctionThe General Ranking Function

abs-rel(c)=1

weight(c)

rpt-inf(c)=∑ ∑p or 1

k1, k2

∊ kw’s

printed answers

or ∑maxp or 1

k1, k2

∊ kw’s

printed answers

score(c) =1

+ ε · rpt-inf(c) abs-rel(c)

1

Page 55: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Score Loss vs. DiversityScore Loss vs. Diversity

0255075

100

0 2 4 6 8 10

Sum, p=1.0

0255075

100

0 2 4 6 8 10

Max, p=0.1

• 5 keywords• Avg. of 4 queries• Top-20 answers

%of max.

ε

Score (1/weight) Connections (u.t. iso.)Connections

The bottom configuration is better than the top oneSmaller reduction of score for similar/higher degree of diversity

Page 56: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

Page 57: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

ConclusionsConclusions

• KPS in complex data graphs has inherent problems that are ignored in existing systems

• 2-component arch.: answer generator & ranker

• 1st component: Enum. algorithm w/ guarantees −Efficient, correct (no missed answers), 2-approximate

order by height− In the paper: Ext. to OR semantics (exact order)

• 2nd component: Dynamically ranks candidates by penalizing them for repeated information−Our experiments over Mondial suggest a tuning of

the parameters that gives the best tradeoff between information gain and score loss

Page 58: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Current & Future ResearchCurrent & Future Research

• Improve / optimize the answer generator−Successful: Parallelism−Concurrent queries?

• Implement different answer generators−E.g., by (approx.) increasing weight [KS-PODS’06]

• Assessment by humans−Relevancy / repeated information −Methodology example: [Zhang et al., SIGIR’02]

• Other aspects−Answer presentation →

Page 59: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Answer PresentationAnswer Presentation• On the Web, we instantly get the meaning of an

answer (Web page) by the <title>, URL and, possibly, a snippet of the text

• In KPS, understanding the meaning of a subtree is note straightforward—need to derive the semantics from the graphical presentation

Page 60: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

What’s the Meaning of this Answer?What’s the Meaning of this Answer?

A snapshot of BANKS demo (http://www.cse.iitb.ac.in/banks/)

IMDB Harder in XML!Harder in XML!• No division into relations (everything is element / attribute)• What information is needed to describe a node?

Page 61: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Answer PresentationAnswer Presentation• On the Web, we instantly understand the

meaning of an answer (Web page) by reading the <title> element, the URL and, possibly, a snapshot of the text

• In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the semantics from the presentation

Solution:Solution:(under

develop.)

•• Graphical presentation is based on restructuring answers in terms of of entities, properties and relationships

•• Apply heuristics for determining the minimal set of properties required for each entity

Page 62: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld
Page 63: The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld

Thank you!Thank you!

Questions?