keyword search on structured data using relevance models

50
FZI FORSCHUNGSZENTRUM INFORMATI K Keyword Search on Structured Data using Relevance Models* Veli Bicer FZI Research Center for Information Technology Karlsruhe, Germany Joint work with Thanh Tran from Semantic Search Group, AIFB Institute, KIT © FZI Forschungszentrum Informatik 1 * based on the papers @ 20 th ACM Conference on Information and Knowledge Management (CIKM’11) and @ 10 th International Semantic Web Conference (ISWC’11)

Upload: thanh-tran

Post on 11-May-2015

590 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 1

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Keyword Search on Structured Data using Relevance Models*

Veli BicerFZI Research Center for Information TechnologyKarlsruhe, Germany

Joint work with Thanh Tran from Semantic Search Group, AIFB Institute, KIT

* based on the papers @ 20th ACM Conference on Information and Knowledge Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)

Page 2: Keyword Search on Structured Data using Relevance Models

12.04.2023 © FZI Forschungszentrum Informatik 2

About the presenter

Veli Bicer Research Scientist at FZI Research Center for Information Technology,

Karlsruhe, Germany Associated Researcher at Karlsruhe Service Research Institute (KSRI)

KSRI founded by IBM Germany

Research Interests Semantic Data Management/Search Relational Learning Software Engineering (for Services)

Projects German Internet Research Programme THESEUS

KOIOS Semantic Search in Core Technology Cluster TEXO Internet-of-Services Use-case

Previously, EU ICT Artemis, Satine, Saphire and Ride

Page 3: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 3

Agenda

Introduction Keyword search on structured data Relevance models

Approach Ranking scheme using relevance models Top-k Query processing

ExperimentsApplication

Search on environmental data

Conclusion

Page 4: Keyword Search on Structured Data using Relevance Models

12.04.2023 © FZI Forschungszentrum Informatik 4

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Introduction

Page 5: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 5

Keyword Search on Structured Data

Rationale 4 billion web searches daily Data-driven websites have relational database backend

Predefined search forms constrain retrieval SQL difficult to learn

simplify data retrieval by not using SQL

Page 6: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 6

Keyword Search on Structured Data

Example Who is the character played by Audrey Hepburn in Roman Holiday?

Person

id name

p1 Audrey Hepburn

p3 Kate Winslet

… ………

Movie

id title plot

m1 Roman Holiday Princess Ann is a royal princess of unknow of an …

m2 The Holiday Iris swaps her cottage for the holiday along the next two …

m3 The Aviator Hughes and Hepburn go to a holiday and fly together ..

… …… …..

Character

id name pid mid

c1 Princess Ann

p1 m1

c3 Iris Simpkins

p3 m2

… ……..

Query result A tree of tuples that is reduced

with respect to the query.

Which would you rather write?

or “Hepburn Holiday”

SELECT C.name FROM Person, Character, MovieWHERE Person.id = Character.pIdAND Character.mid = Movie.idAND Person.name = ‘Audrey Hepburn'AND Movie.title = ‘Roman Holiday' ;

Page 7: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 7

Keyword Search on Structured Data

Many approaches are proposed recently Performance focus Less consideration of ranking

Recent study (Coffman and Weaver, CIKM 2010) effectiveness of previous works are below expectations problem about ranking strategies, not performance

Two major types of ranking schemes: IR-inspired TF-IDF ranking

(Liu et al, 2006) (SPARK, 2007) Proximity based approaches

(Banks, 2002) (Bidirectional, 2005)

Problem: Missing a robust and principled approach!!

Page 8: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 8

Relevance Models

Proposed by Lavrenko and Croft (SIGIR 01)

Assumes that queries and documents are samples from a

hidden representation space and generated from the same generative model

Initial representation of relevance is unknown Estimated from query

Q DClassical Model

Q DLanguage Model

Q DRelevance Model

R

R

Page 9: Keyword Search on Structured Data using Relevance Models

12.04.2023 © FZI Forschungszentrum Informatik 9

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Approach

Page 10: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 10

Overview of Approach

words p

hepburn 0.5

holiday 0.5

words p

hepburn 0.21

holiday 0.15

audrey 0.13

katharine 0.09

princess 0.01

roman 0.01

…. …

Query1 PRF2 Query RM3 Res. RM4words p

hepburn 0.12

holiday 0.18

audrey 0.11

katharine 0.05

princess 0.00

roman 0.06

…. …

Res. Score5

D(RMQ||RMR)

Query Generation6 Structured Queries7 Top-k Query Proc.8 Result Ranking9

Title Name

Roman Holiday Audrey Hepburn

Breakfast at Tiff. Audrey Hepburn

The Aviator Katharine Hepbun

The Holiday Kate Winslet

Page 11: Keyword Search on Structured Data using Relevance Models

12.04.2023 © FZI Forschungszentrum Informatik 11

Data Model

Different kinds of data e.g. relational, XML and RDF data

Data Graph of nodes and edges (G=(V,E))

Resource nodes, attribute nodes Every resource is typed Resources have unique ids, (e.g. primary keys)

Page 12: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 12

holiday à m1,m2,m3

hepburn à m3,p1,p4,c2

Edge-Specific Relevance Models

A set of feedback resources FR are retrieved from an inverted keyword index:

E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}

Edge-specific relevance model for each unique edge e:

Importance of resource w.r.t. query

Probability of word at resource

1 2 3

Inverted Index

princess à m1, c1

breakfast à m3

hepburn à m3,p1,p4,c2

melbourne à p2

iris à c3

holiday à m1,m2,m3

breakfast à m3

ann à m1,c2

………. … …….

p1

Audrey Hepburn

name

Ixelles Belgium

birthplace

m3The Holidaytitle

Iris swaps her cottage for the

holiday along the next two

plot

FR Edge-specific Relevance Models

…..

Page 13: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 13

Edge Specific Resource Models

Each resource (a tuple) is also represented as a RM …as final results (joint tuples) are obtained by combining resources

Edge-specific resource model:

The score of resource: cross-entropy of edge-specific RM and ResM:

4 5

Page 14: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 14

Smoothing

Well-known technique to address data sparseness and improve accuracy of RMs (and LMs) is the core probability for both query and resource RM

Local smoothing

Neighborhood of attribute a is another attribute a’: a and a’ shares the same resources resources of a and a’ are of the same type resources of a and a’ are connected over a FK

Neighborhood of a

Page 15: Keyword Search on Structured Data using Relevance Models

12.04.2023 © FZI Forschungszentrum Informatik 15

Smoothingwords

audrey

hepburn

p1

Audrey Hepburn

name

type

Person 0.5

0.5

)|( 1pvPar

name

Ixelles Belgium

birthplace

ixelles

belgium

0.4

0.4

0.1

0.1p4

Katharine Hepburn

name

type

Connecticut USA

birthplace

katharine

connecticut

usa

0.37

0.39

0.09

0.09

0.02

0.02

0.02

0.36

0.38

0.08

0.08

0.01

0.01

0.01

0.035

0.035

princess

ann

c1

Princess Ann

name

type

Character

pid_fk

Smoothing of each type is controlled by weights:

where γ1 ,γ2 ,γ3 are control parameters set in experiments

Page 16: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 16

Ranking JRTs

Ranking aggregated JRTs: Cross entropy between edge-specific RM (Query Model) and geometric

mean of combined edge-specific ResM:

The proposed score is monotonic w.r.t. individual resource scores …a desired property for most of top-k algorithms

9

Page 17: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 17

Query Translation*

Mapping of keywords to data elements Result in a set of keyword elements

Data Graph exploration Search for substructures (query graph)

connecting keyword elements Bi-directional exploration of query

graphs operates on summary of data graph only

Top-k computation Search guided by a scoring function to

output only the top-k queries

Query graphs to be processed Free vs. Non-free variables

*[Tran et al. ICDE’09]

p1

Person

p4

name

type

m1

Holiday

title

type

Character

Hepburn Hepburn

name

Movie

Location

bornIn

pid_fk

mid_fk

Studio

hasDisthasLoc

worksFor

Producer

Is-a

m3

Holiday

title

?pname

Hepburn

Person

type

Character

6 7

?c

type

pid_fk?m

Movie

type

Holiday

title

SummaryGraph

mid_fk

Page 18: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 18

Top-k Query Processing

Top-k query processing (TQP) is highly common in Web-accessible databases

return K highest-ranked answers avoid unnecessary accesses to database

TQP assumes Scoring function and attribute values to be known a-priori (e.g. RankJoin) Combine attribute values by aggregation function Sorted access (SA), random access (RA) probes

How to adapt TQP to return top-k relevant results? Results are joined set of resources Scores are query-dependent

No indexing is possible

Idea: Retrieve resources for non-free variables and rank Use SA on those initially retrieved resources Use RA to find other resources

8

Page 19: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 19

Top-k Query Processing

Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann

c2 Katharine Hepburn

c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.11

0.50

<(p1,*,*),0.50>

<(*,*,m2),0.50>

Page 20: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 20

Top-k Query Processing

Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann

c2 Katharine Hepburn

c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.11

0.48

<(p1,*,*),0.50>

<(*,*,m2),0.50>

<(p3,*,*),0.48>

Page 21: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 21

Top-k Query Processing

Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann 0.10

c2 Katharine Hepburn

c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.10

0.47

<(p1,c1,*),0.49>

<(*,*,m2),0.50>

<(p3,*,*),0.48>

Page 22: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 22

Top-k Query Processing

Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann 0.10

c2 Katharine Hepburn

c3 Iris Simpkins 0.05

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.09

0.46

<(p1,c1,*),0.49>

<(*,c3,m2),0.44>

<(p3,*,*),0.48>

Page 23: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 23

Top-k Query Processing

Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann 0.10

c2 Katharine Hepburn

c3 Iris Simpkins 0.05

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.09

0.46

<(*,c3,m2),0.44>

<(p3,*,*),0.48>

<(p1,c1,m1),0.48>

Page 24: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 24

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Experiments

Page 25: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 25

Experiments

Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases

Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries

Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)

Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).

RM-S: Our approach

Page 26: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 26

Experiments

MAP scores for all queries

Reciprocal rank for single resource queries

Page 27: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 27

Experiments

Precision-recall for TREC-style queries on Wikipedia

Page 28: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 28

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Application

Page 29: Keyword Search on Structured Data using Relevance Models

Large amount of environmental data

Environmental issues stir public interests Increase transparency, awareness, responsibility, protection

Growing amount of data Public access through EU directive 2003/4/EC PortalU (Germany) http://www.portalu.de/ EDP (UK) http://www.edp.nerc.ac.uk Envirofacts (USA) http://www.epa.gov/enviro/index.html

Linking data in international context Local government databases of environmental part of LOD cloud Linked environment data for the life sciences

© FZI Forschungszentrum Informatik 29

Page 30: Keyword Search on Structured Data using Relevance Models

Opportunity: mass dissemination and consumption of environmental dataThe percentage of people who actively find

environmental information is significantly lower than those who have those with frequent access to it!

Complex results CO emission values around Karlsruhe area in Germany

Analytics CO emission values around Karlsruhe area in Germany

Sorted by year Bar chart

Emission values of US and Germany Compare average Timeline visualization

© FZI Forschungszentrum Informatik 30

Page 31: Keyword Search on Structured Data using Relevance Models

KOIOS – Overview

A semantic search system Exploit semantics in the data for keywords interpretation to hide

complexity of query languages and data representation Keyword search for searching structured data Lower access barriers while enabling richness of data to be fully

harnessed

Contribution Transfer research results to commercial EIS Selector mechanism

Process Input: keywords Facet-based refinement Selector (result and view template) initialization Output: query results embedded in specific views

© FZI Forschungszentrum Informatik 31

Page 32: Keyword Search on Structured Data using Relevance Models

KOIOS – Architecture

© FZI Forschungszentrum Informatik 32

Page 33: Keyword Search on Structured Data using Relevance Models

Facets generation

Derive facets from query results (not from query!) for refinement Attributes serve as facet categories Attribute values as facet values

E.g. for ?s Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”… Value.year: 2005,2006,…

© FZI Forschungszentrum Informatik 33

Page 34: Keyword Search on Structured Data using Relevance Models

Selectors

Selector: parameterized, predefined result and view templates Data parameters: specify scope of information need, initialized to a

particular values based on facet categories and values Query parameter: additional data processing for analysis tasks

(GROUP-BY, SORT, MIN, MAX, AVERAGE etc.) Presentation parameter: visualization types (data value, data series,

data table, map-based, specific diagram type, etc.)

© FZI Forschungszentrum Informatik 34

Page 35: Keyword Search on Structured Data using Relevance Models

Selector initialization

Selectors capture templates for information needs and presentation of their

results

Map facets to selectors and initialize them Applicable selectors: cover facet categories Initialize selectors based on facet values Initialized values are captured in the WHERE clause Non-initialized parameters are included in the SELECT clause

© FZI Forschungszentrum Informatik 35

Page 36: Keyword Search on Structured Data using Relevance Models

Deployment

Hippolytos project (Theseus) Easy access to spatial data

warehouse (disy Cadenza) built for domain of environmental administration

Data about Emission and waste From the Baden-Württemberg Provided by:

Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden-Württemberg and Statistisches Landesamt Baden-Württemberg

© FZI Forschungszentrum Informatik 36

Page 37: Keyword Search on Structured Data using Relevance Models

Facets and selectors

© FZI Forschungszentrum Informatik 37

Page 38: Keyword Search on Structured Data using Relevance Models

Chart-based visualization

Page 39: Keyword Search on Structured Data using Relevance Models

Map-based visualization

Page 40: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 40

Conclusions

Keyword search on structured data is a popular problem for which various solutions exist.

We focus on the aspect of result ranking, providing a principled approach that employs relevance models.

Experiments show that RMs are promising for searching structured data.

Top-k Query processing proposed to get only most relevant results

Application on environmental data enables intuitive Access Visualization Analysis of environmental information!

Page 41: Keyword Search on Structured Data using Relevance Models

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Thank you for your attention!

Questions?

Page 42: Keyword Search on Structured Data using Relevance Models

Opportunity: mass dissemination and consumption of environmental data

Increase transparency, awareness, responsibility, protection

© FZI Forschungszentrum Informatik 42

Page 43: Keyword Search on Structured Data using Relevance Models

Challenges: intuitive access and visualization of structured environmental data and analytics

The percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it!

Complex structured queriesKnowledge of the underlying data / query language

Complex structured dataHeterogeneity and distribution of environmental data is overwhelming

Complex structured resultsUnderstanding results and extracting relevant information / analytics are difficult tasks

© FZI Forschungszentrum Informatik 43

Page 44: Keyword Search on Structured Data using Relevance Models

KOIOS

Semantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information

Overview and architecture

Structured query generation from keywords

Facet-based browsing and refinement

Selector initialization for final result and view construction

Implementation and deployment

Conclusions

© FZI Forschungszentrum Informatik 44

Page 45: Keyword Search on Structured Data using Relevance Models

Conclusions

Replace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data

and schema to dynamically Translate keywords to queries Generate facets for results Initialize result and presentation templates

Enables intuitive Access Visualization Analysis of environmental information!

© FZI Forschungszentrum Informatik 45

Page 46: Keyword Search on Structured Data using Relevance Models

© FZI Forschungszentrum Informatik 4904.04.2011

Inverted Indexprincess m1, c1

breakfast m3

hepburn m3,p1,p4,c2

melbourne p2

iris c3

holiday m1,m2,m3

breakfast m3

ann m1,c2

………. … …….

Page 47: Keyword Search on Structured Data using Relevance Models

Ranking Schemes

Proximity between keyword nodes EASE:

XRank: w is the smallest text window in n that contains all search keywords

SIGMOD09 Tutorial 5004/12/2023

Page 48: Keyword Search on Structured Data using Relevance Models

Ranking Schemes

Based on graph structure BANKS

Nodes: Edges :

PageRank-like methods XRank [Guo et al, SIGMOD03]

ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank

SIGMOD09 Tutorial 5104/12/2023

Page 49: Keyword Search on Structured Data using Relevance Models

Ranking Schemes

TF*IDF based: Discover/EASE [Liu et al, SIGMOD06]

SPARK but not at the node level

SIGMOD09 Tutorial 5204/12/2023

df

N

avdldlss

tfQnScore

nQw

1ln

/)1(

))ln(1ln(1),(

Page 50: Keyword Search on Structured Data using Relevance Models

P(w|Q) w

.077 palestinian

.055 israel

.034 jerusalem

.033 protest

.027 raid

.011 clash

.010 bank

.010 west

.010 troop

sample probabilities

palestinian

israeli

raids

???

q1q2q3

w

q Mk

k wMPMqPqqP

wPqqwP )|()|(

)...(

)()...|(

11

)|...( 1 wqqP k

)|( wqP

M

M

M

Relevance Model

Relevance Models