keyword search on structured data using relevance models

© FZI Forschungszentrum Informatik 1

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Keyword Search on Structured Data using Relevance Models*

Veli BicerFZI Research Center for Information TechnologyKarlsruhe, Germany

Joint work with Thanh Tran from Semantic Search Group, AIFB Institute, KIT

* based on the papers @ 20th ACM Conference on Information and Knowledge Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)

12.04.2023 © FZI Forschungszentrum Informatik 2

About the presenter

Veli Bicer Research Scientist at FZI Research Center for Information Technology,

Karlsruhe, Germany Associated Researcher at Karlsruhe Service Research Institute (KSRI)

KSRI founded by IBM Germany

Research Interests Semantic Data Management/Search Relational Learning Software Engineering (for Services)

Projects German Internet Research Programme THESEUS

KOIOS Semantic Search in Core Technology Cluster TEXO Internet-of-Services Use-case

Previously, EU ICT Artemis, Satine, Saphire and Ride


Agenda

Introduction Keyword search on structured data Relevance models

Approach Ranking scheme using relevance models Top-k Query processing

ExperimentsApplication

Search on environmental data

Conclusion


FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Introduction


Keyword Search on Structured Data

Rationale 4 billion web searches daily Data-driven websites have relational database backend

Predefined search forms constrain retrieval SQL difficult to learn

simplify data retrieval by not using SQL



Example Who is the character played by Audrey Hepburn in Roman Holiday?

Person

id name

p1 Audrey Hepburn

p3 Kate Winslet

… ………

Movie

id title plot

m1 Roman Holiday Princess Ann is a royal princess of unknow of an …

m2 The Holiday Iris swaps her cottage for the holiday along the next two …

m3 The Aviator Hughes and Hepburn go to a holiday and fly together ..

… …… …..

Character

id name pid mid

c1 Princess Ann

p1 m1

c3 Iris Simpkins

p3 m2

… ……..

Query result A tree of tuples that is reduced

with respect to the query.

Which would you rather write?

or “Hepburn Holiday”

SELECT C.name FROM Person, Character, MovieWHERE Person.id = Character.pIdAND Character.mid = Movie.idAND Person.name = ‘Audrey Hepburn'AND Movie.title = ‘Roman Holiday' ;



Many approaches are proposed recently Performance focus Less consideration of ranking

Recent study (Coffman and Weaver, CIKM 2010) effectiveness of previous works are below expectations problem about ranking strategies, not performance

Two major types of ranking schemes: IR-inspired TF-IDF ranking

(Liu et al, 2006) (SPARK, 2007) Proximity based approaches

(Banks, 2002) (Bidirectional, 2005)

Problem: Missing a robust and principled approach!!


Relevance Models

Proposed by Lavrenko and Croft (SIGIR 01)

Assumes that queries and documents are samples from a

hidden representation space and generated from the same generative model

Initial representation of relevance is unknown Estimated from query

Q DClassical Model

Q DLanguage Model

Q DRelevance Model

R

R


FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Approach


Overview of Approach

words p

hepburn 0.5

holiday 0.5

words p

hepburn 0.21

holiday 0.15

audrey 0.13

katharine 0.09

princess 0.01

roman 0.01

…. …

Query1 PRF2 Query RM3 Res. RM4words p

hepburn 0.12

holiday 0.18

audrey 0.11

katharine 0.05

princess 0.00

roman 0.06

…. …

Res. Score5

D(RMQ||RMR)

Query Generation6 Structured Queries7 Top-k Query Proc.8 Result Ranking9

Title Name

Roman Holiday Audrey Hepburn

Breakfast at Tiff. Audrey Hepburn

The Aviator Katharine Hepbun

The Holiday Kate Winslet


Data Model

Different kinds of data e.g. relational, XML and RDF data

Data Graph of nodes and edges (G=(V,E))

Resource nodes, attribute nodes Every resource is typed Resources have unique ids, (e.g. primary keys)


holiday à m1,m2,m3

hepburn à m3,p1,p4,c2

Edge-Specific Relevance Models

A set of feedback resources FR are retrieved from an inverted keyword index:

E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}

Edge-specific relevance model for each unique edge e:

Importance of resource w.r.t. query

Probability of word at resource

1 2 3

Inverted Index

princess à m1, c1

breakfast à m3

hepburn à m3,p1,p4,c2

melbourne à p2

iris à c3

holiday à m1,m2,m3

breakfast à m3

ann à m1,c2

………. … …….

p1

Audrey Hepburn

name

Ixelles Belgium

birthplace

m3The Holidaytitle

Iris swaps her cottage for the

holiday along the next two

plot

FR Edge-specific Relevance Models

…..


Edge Specific Resource Models

Each resource (a tuple) is also represented as a RM …as final results (joint tuples) are obtained by combining resources

Edge-specific resource model:

The score of resource: cross-entropy of edge-specific RM and ResM:

4 5


Smoothing

Well-known technique to address data sparseness and improve accuracy of RMs (and LMs) is the core probability for both query and resource RM

Local smoothing

Neighborhood of attribute a is another attribute a’: a and a’ shares the same resources resources of a and a’ are of the same type resources of a and a’ are connected over a FK

Neighborhood of a


Smoothingwords

audrey

hepburn

p1

Audrey Hepburn

name

type

Person 0.5

0.5

)|( 1pvPar

name

Ixelles Belgium

birthplace

ixelles

belgium

0.4

0.4

0.1

0.1p4

Katharine Hepburn

name

type

Connecticut USA

birthplace

katharine

connecticut

usa

0.37

0.39

0.09

0.09

0.02

0.02

0.02

0.36

0.38

0.08

0.08

0.01

0.01

0.01

0.035

0.035

princess

ann

c1

Princess Ann

name

type

Character

pid_fk

Smoothing of each type is controlled by weights:

where γ1 ,γ2 ,γ3 are control parameters set in experiments


Ranking JRTs

Ranking aggregated JRTs: Cross entropy between edge-specific RM (Query Model) and geometric

mean of combined edge-specific ResM:

The proposed score is monotonic w.r.t. individual resource scores …a desired property for most of top-k algorithms

9


Query Translation*

Mapping of keywords to data elements Result in a set of keyword elements

Data Graph exploration Search for substructures (query graph)

connecting keyword elements Bi-directional exploration of query

graphs operates on summary of data graph only

Top-k computation Search guided by a scoring function to

output only the top-k queries

Query graphs to be processed Free vs. Non-free variables

*[Tran et al. ICDE’09]

p1

Person

p4

name

type

m1

Holiday

title

type

Character

Hepburn Hepburn

name

Movie

Location

bornIn

pid_fk

mid_fk

Studio

hasDisthasLoc

worksFor

Producer

Is-a

m3

Holiday

title

?pname

Hepburn

Person

type

Character

6 7

?c

type

pid_fk?m

Movie

type

Holiday

title

SummaryGraph

mid_fk


Top-k Query Processing

Top-k query processing (TQP) is highly common in Web-accessible databases

return K highest-ranked answers avoid unnecessary accesses to database

TQP assumes Scoring function and attribute values to be known a-priori (e.g. RankJoin) Combine attribute values by aggregation function Sorted access (SA), random access (RA) probes

How to adapt TQP to return top-k relevant results? Results are joined set of resources Scores are query-dependent

No indexing is possible

Idea: Retrieve resources for non-free variables and rank Use SA on those initially retrieved resources Use RA to find other resources

8



Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded

Binding operator c’=(c,xiri)

Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired

Person

id name S(r)

p1 Audrey Hepburn 0.20

p3 Katharine Hepburn 0.18

p5 Philip Hepburn 0.13

p6 Anna Hepburn 0.12

Movie

id title S(r)

m2 The Holiday 0.19

m1 Roman Holiday 0.18

m3 Holiday Blues 0.09

m4 Family Holiday 0.08

Character

id name S(r)

c1 Princess Ann

c2 Katharine Hepburn

c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.11

0.50

<(p1,*,*),0.50>

<(*,*,m2),0.50>






Person

id name S(r)





Movie

id title S(r)

m2 The Holiday 0.19




Character

id name S(r)

c1 Princess Ann


c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.11

0.48

<(p1,*,*),0.50>

<(*,*,m2),0.50>

<(p3,*,*),0.48>






Person

id name S(r)





Movie

id title S(r)

m2 The Holiday 0.19




Character

id name S(r)

c1 Princess Ann 0.10


c3 Iris Simpkins

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.10

0.47

<(p1,c1,*),0.49>

<(*,*,m2),0.50>

<(p3,*,*),0.48>






Person

id name S(r)





Movie

id title S(r)

m2 The Holiday 0.19




Character

id name S(r)



c3 Iris Simpkins 0.05

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.09

0.46

<(p1,c1,*),0.49>

<(*,c3,m2),0.44>

<(p3,*,*),0.48>






Person

id name S(r)





Movie

id title S(r)

m2 The Holiday 0.19




Character

id name S(r)



c3 Iris Simpkins 0.05

c4 Louise

Threshold

Output K=1

Priority Queue

?pname

Hepburn

Person

type

Character

?c

type

pid_fk?m

Movie

type

Holiday

title

mid_fk

0.09

0.46

<(*,c3,m2),0.44>

<(p3,*,*),0.48>

<(p1,c1,m1),0.48>


FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Experiments


Experiments

Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases

Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries

Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)

Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).

RM-S: Our approach


Experiments

MAP scores for all queries

Reciprocal rank for single resource queries


Experiments

Precision-recall for TREC-style queries on Wikipedia


FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Application

Large amount of environmental data

Environmental issues stir public interests Increase transparency, awareness, responsibility, protection

Growing amount of data Public access through EU directive 2003/4/EC PortalU (Germany) http://www.portalu.de/ EDP (UK) http://www.edp.nerc.ac.uk Envirofacts (USA) http://www.epa.gov/enviro/index.html

Linking data in international context Local government databases of environmental part of LOD cloud Linked environment data for the life sciences


http://www.portalu.de/

http://www.edp.nerc.ac.uk/

http://www.epa.gov/enviro/index.html

Opportunity: mass dissemination and consumption of environmental dataThe percentage of people who actively find

environmental information is significantly lower than those who have those with frequent access to it!

Complex results CO emission values around Karlsruhe area in Germany

Analytics CO emission values around Karlsruhe area in Germany

Sorted by year Bar chart

Emission values of US and Germany Compare average Timeline visualization


KOIOS – Overview

A semantic search system Exploit semantics in the data for keywords interpretation to hide

complexity of query languages and data representation Keyword search for searching structured data Lower access barriers while enabling richness of data to be fully

harnessed

Contribution Transfer research results to commercial EIS Selector mechanism

Process Input: keywords Facet-based refinement Selector (result and view template) initialization Output: query results embedded in specific views


KOIOS – Architecture


Facets generation

Derive facets from query results (not from query!) for refinement Attributes serve as facet categories Attribute values as facet values

E.g. for ?s Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”… Value.year: 2005,2006,…


Selectors

Selector: parameterized, predefined result and view templates Data parameters: specify scope of information need, initialized to a

particular values based on facet categories and values Query parameter: additional data processing for analysis tasks

(GROUP-BY, SORT, MIN, MAX, AVERAGE etc.) Presentation parameter: visualization types (data value, data series,

data table, map-based, specific diagram type, etc.)


Selector initialization

Selectors capture templates for information needs and presentation of their

results

Map facets to selectors and initialize them Applicable selectors: cover facet categories Initialize selectors based on facet values Initialized values are captured in the WHERE clause Non-initialized parameters are included in the SELECT clause


Deployment

Hippolytos project (Theseus) Easy access to spatial data

warehouse (disy Cadenza) built for domain of environmental administration

Data about Emission and waste From the Baden-Württemberg Provided by:

Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden-Württemberg and Statistisches Landesamt Baden-Württemberg


Facets and selectors


Chart-based visualization

Map-based visualization


Conclusions

Keyword search on structured data is a popular problem for which various solutions exist.

We focus on the aspect of result ranking, providing a principled approach that employs relevance models.

Experiments show that RMs are promising for searching structured data.

Top-k Query processing proposed to get only most relevant results

Application on environmental data enables intuitive Access Visualization Analysis of environmental information!

FZI F

ORS

CHUN

GSZ

ENTR

UMIN

FORM

ATIK

Thank you for your attention!

Questions?

Opportunity: mass dissemination and consumption of environmental data

Increase transparency, awareness, responsibility, protection


Challenges: intuitive access and visualization of structured environmental data and analytics

The percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it!

Complex structured queriesKnowledge of the underlying data / query language

Complex structured dataHeterogeneity and distribution of environmental data is overwhelming

Complex structured resultsUnderstanding results and extracting relevant information / analytics are difficult tasks


KOIOS

Semantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information

Overview and architecture

Structured query generation from keywords

Facet-based browsing and refinement

Selector initialization for final result and view construction

Implementation and deployment

Conclusions


Conclusions

Replace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data

and schema to dynamically Translate keywords to queries Generate facets for results Initialize result and presentation templates

Enables intuitive Access Visualization Analysis of environmental information!


© FZI Forschungszentrum Informatik 4904.04.2011

Inverted Indexprincess m1, c1

breakfast m3

hepburn m3,p1,p4,c2

melbourne p2

iris c3

holiday m1,m2,m3

breakfast m3

ann m1,c2

………. … …….

Ranking Schemes

Proximity between keyword nodes EASE:

XRank: w is the smallest text window in n that contains all search keywords

SIGMOD09 Tutorial 5004/12/2023

Ranking Schemes

Based on graph structure BANKS

Nodes: Edges :

PageRank-like methods XRank [Guo et al, SIGMOD03]

ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank


Ranking Schemes

TF*IDF based: Discover/EASE [Liu et al, SIGMOD06]

SPARK but not at the node level


df

N

avdldlss

tfQnScore

nQw

1ln

/)1(

))ln(1ln(1),(

P(w|Q) w

.077 palestinian

.055 israel

.034 jerusalem

.033 protest

.027 raid

.011 clash

.010 bank

.010 west

.010 troop

…

sample probabilities

palestinian

israeli

raids

???

q1q2q3

w

q Mk

k wMPMqPqqP

wPqqwP )|()|(

)...(

)()...|(

11

)|...( 1 wqqP k

)|( wqP

M

M

M

Relevance Model

Relevance Models

keyword search on structured data using relevance models

Education

hepburn holiday holiday

query resultpersoncharacter

query rm4res

c2 holiday

data retrieval

semantic search group

data sparseness

relevance modelsproposed