keyword search on structured data using relevance models
TRANSCRIPT
© FZI Forschungszentrum Informatik 1
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Keyword Search on Structured Data using Relevance Models*
Veli BicerFZI Research Center for Information TechnologyKarlsruhe, Germany
Joint work with Thanh Tran from Semantic Search Group, AIFB Institute, KIT
* based on the papers @ 20th ACM Conference on Information and Knowledge Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)
12.04.2023 © FZI Forschungszentrum Informatik 2
About the presenter
Veli Bicer Research Scientist at FZI Research Center for Information Technology,
Karlsruhe, Germany Associated Researcher at Karlsruhe Service Research Institute (KSRI)
KSRI founded by IBM Germany
Research Interests Semantic Data Management/Search Relational Learning Software Engineering (for Services)
Projects German Internet Research Programme THESEUS
KOIOS Semantic Search in Core Technology Cluster TEXO Internet-of-Services Use-case
Previously, EU ICT Artemis, Satine, Saphire and Ride
© FZI Forschungszentrum Informatik 3
Agenda
Introduction Keyword search on structured data Relevance models
Approach Ranking scheme using relevance models Top-k Query processing
ExperimentsApplication
Search on environmental data
Conclusion
12.04.2023 © FZI Forschungszentrum Informatik 4
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Introduction
© FZI Forschungszentrum Informatik 5
Keyword Search on Structured Data
Rationale 4 billion web searches daily Data-driven websites have relational database backend
Predefined search forms constrain retrieval SQL difficult to learn
simplify data retrieval by not using SQL
© FZI Forschungszentrum Informatik 6
Keyword Search on Structured Data
Example Who is the character played by Audrey Hepburn in Roman Holiday?
Person
id name
p1 Audrey Hepburn
p3 Kate Winslet
… ………
Movie
id title plot
m1 Roman Holiday Princess Ann is a royal princess of unknow of an …
m2 The Holiday Iris swaps her cottage for the holiday along the next two …
m3 The Aviator Hughes and Hepburn go to a holiday and fly together ..
… …… …..
Character
id name pid mid
c1 Princess Ann
p1 m1
c3 Iris Simpkins
p3 m2
… ……..
Query result A tree of tuples that is reduced
with respect to the query.
Which would you rather write?
or “Hepburn Holiday”
SELECT C.name FROM Person, Character, MovieWHERE Person.id = Character.pIdAND Character.mid = Movie.idAND Person.name = ‘Audrey Hepburn'AND Movie.title = ‘Roman Holiday' ;
© FZI Forschungszentrum Informatik 7
Keyword Search on Structured Data
Many approaches are proposed recently Performance focus Less consideration of ranking
Recent study (Coffman and Weaver, CIKM 2010) effectiveness of previous works are below expectations problem about ranking strategies, not performance
Two major types of ranking schemes: IR-inspired TF-IDF ranking
(Liu et al, 2006) (SPARK, 2007) Proximity based approaches
(Banks, 2002) (Bidirectional, 2005)
Problem: Missing a robust and principled approach!!
© FZI Forschungszentrum Informatik 8
Relevance Models
Proposed by Lavrenko and Croft (SIGIR 01)
Assumes that queries and documents are samples from a
hidden representation space and generated from the same generative model
Initial representation of relevance is unknown Estimated from query
Q DClassical Model
Q DLanguage Model
Q DRelevance Model
R
R
12.04.2023 © FZI Forschungszentrum Informatik 9
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Approach
© FZI Forschungszentrum Informatik 10
Overview of Approach
words p
hepburn 0.5
holiday 0.5
words p
hepburn 0.21
holiday 0.15
audrey 0.13
katharine 0.09
princess 0.01
roman 0.01
…. …
Query1 PRF2 Query RM3 Res. RM4words p
hepburn 0.12
holiday 0.18
audrey 0.11
katharine 0.05
princess 0.00
roman 0.06
…. …
Res. Score5
D(RMQ||RMR)
Query Generation6 Structured Queries7 Top-k Query Proc.8 Result Ranking9
Title Name
Roman Holiday Audrey Hepburn
Breakfast at Tiff. Audrey Hepburn
The Aviator Katharine Hepbun
The Holiday Kate Winslet
12.04.2023 © FZI Forschungszentrum Informatik 11
Data Model
Different kinds of data e.g. relational, XML and RDF data
Data Graph of nodes and edges (G=(V,E))
Resource nodes, attribute nodes Every resource is typed Resources have unique ids, (e.g. primary keys)
© FZI Forschungszentrum Informatik 12
holiday à m1,m2,m3
hepburn à m3,p1,p4,c2
Edge-Specific Relevance Models
A set of feedback resources FR are retrieved from an inverted keyword index:
E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}
Edge-specific relevance model for each unique edge e:
Importance of resource w.r.t. query
Probability of word at resource
1 2 3
Inverted Index
princess à m1, c1
breakfast à m3
hepburn à m3,p1,p4,c2
melbourne à p2
iris à c3
holiday à m1,m2,m3
breakfast à m3
ann à m1,c2
………. … …….
p1
Audrey Hepburn
name
Ixelles Belgium
birthplace
m3The Holidaytitle
Iris swaps her cottage for the
holiday along the next two
plot
FR Edge-specific Relevance Models
…..
© FZI Forschungszentrum Informatik 13
Edge Specific Resource Models
Each resource (a tuple) is also represented as a RM …as final results (joint tuples) are obtained by combining resources
Edge-specific resource model:
The score of resource: cross-entropy of edge-specific RM and ResM:
4 5
© FZI Forschungszentrum Informatik 14
Smoothing
Well-known technique to address data sparseness and improve accuracy of RMs (and LMs) is the core probability for both query and resource RM
Local smoothing
Neighborhood of attribute a is another attribute a’: a and a’ shares the same resources resources of a and a’ are of the same type resources of a and a’ are connected over a FK
Neighborhood of a
12.04.2023 © FZI Forschungszentrum Informatik 15
Smoothingwords
audrey
hepburn
p1
Audrey Hepburn
name
type
Person 0.5
0.5
)|( 1pvPar
name
Ixelles Belgium
birthplace
ixelles
belgium
0.4
0.4
0.1
0.1p4
Katharine Hepburn
name
type
Connecticut USA
birthplace
katharine
connecticut
usa
0.37
0.39
0.09
0.09
0.02
0.02
0.02
0.36
0.38
0.08
0.08
0.01
0.01
0.01
0.035
0.035
princess
ann
c1
Princess Ann
name
type
Character
pid_fk
Smoothing of each type is controlled by weights:
where γ1 ,γ2 ,γ3 are control parameters set in experiments
© FZI Forschungszentrum Informatik 16
Ranking JRTs
Ranking aggregated JRTs: Cross entropy between edge-specific RM (Query Model) and geometric
mean of combined edge-specific ResM:
The proposed score is monotonic w.r.t. individual resource scores …a desired property for most of top-k algorithms
9
© FZI Forschungszentrum Informatik 17
Query Translation*
Mapping of keywords to data elements Result in a set of keyword elements
Data Graph exploration Search for substructures (query graph)
connecting keyword elements Bi-directional exploration of query
graphs operates on summary of data graph only
Top-k computation Search guided by a scoring function to
output only the top-k queries
Query graphs to be processed Free vs. Non-free variables
*[Tran et al. ICDE’09]
p1
Person
p4
name
type
m1
Holiday
title
type
Character
Hepburn Hepburn
name
Movie
Location
bornIn
pid_fk
mid_fk
Studio
hasDisthasLoc
worksFor
Producer
Is-a
m3
Holiday
title
?pname
Hepburn
Person
type
Character
6 7
?c
type
pid_fk?m
Movie
type
Holiday
title
SummaryGraph
mid_fk
© FZI Forschungszentrum Informatik 18
Top-k Query Processing
Top-k query processing (TQP) is highly common in Web-accessible databases
return K highest-ranked answers avoid unnecessary accesses to database
TQP assumes Scoring function and attribute values to be known a-priori (e.g. RankJoin) Combine attribute values by aggregation function Sorted access (SA), random access (RA) probes
How to adapt TQP to return top-k relevant results? Results are joined set of resources Scores are query-dependent
No indexing is possible
Idea: Retrieve resources for non-free variables and rank Use SA on those initially retrieved resources Use RA to find other resources
8
© FZI Forschungszentrum Informatik 19
Top-k Query Processing
Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded
Binding operator c’=(c,xiri)
Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired
Person
id name S(r)
p1 Audrey Hepburn 0.20
p3 Katharine Hepburn 0.18
p5 Philip Hepburn 0.13
p6 Anna Hepburn 0.12
Movie
id title S(r)
m2 The Holiday 0.19
m1 Roman Holiday 0.18
m3 Holiday Blues 0.09
m4 Family Holiday 0.08
Character
id name S(r)
c1 Princess Ann
c2 Katharine Hepburn
c3 Iris Simpkins
c4 Louise
Threshold
Output K=1
Priority Queue
?pname
Hepburn
Person
type
Character
?c
type
pid_fk?m
Movie
type
Holiday
title
mid_fk
0.11
0.50
<(p1,*,*),0.50>
<(*,*,m2),0.50>
© FZI Forschungszentrum Informatik 20
Top-k Query Processing
Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded
Binding operator c’=(c,xiri)
Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired
Person
id name S(r)
p1 Audrey Hepburn 0.20
p3 Katharine Hepburn 0.18
p5 Philip Hepburn 0.13
p6 Anna Hepburn 0.12
Movie
id title S(r)
m2 The Holiday 0.19
m1 Roman Holiday 0.18
m3 Holiday Blues 0.09
m4 Family Holiday 0.08
Character
id name S(r)
c1 Princess Ann
c2 Katharine Hepburn
c3 Iris Simpkins
c4 Louise
Threshold
Output K=1
Priority Queue
?pname
Hepburn
Person
type
Character
?c
type
pid_fk?m
Movie
type
Holiday
title
mid_fk
0.11
0.48
<(p1,*,*),0.50>
<(*,*,m2),0.50>
<(p3,*,*),0.48>
© FZI Forschungszentrum Informatik 21
Top-k Query Processing
Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded
Binding operator c’=(c,xiri)
Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired
Person
id name S(r)
p1 Audrey Hepburn 0.20
p3 Katharine Hepburn 0.18
p5 Philip Hepburn 0.13
p6 Anna Hepburn 0.12
Movie
id title S(r)
m2 The Holiday 0.19
m1 Roman Holiday 0.18
m3 Holiday Blues 0.09
m4 Family Holiday 0.08
Character
id name S(r)
c1 Princess Ann 0.10
c2 Katharine Hepburn
c3 Iris Simpkins
c4 Louise
Threshold
Output K=1
Priority Queue
?pname
Hepburn
Person
type
Character
?c
type
pid_fk?m
Movie
type
Holiday
title
mid_fk
0.10
0.47
<(p1,c1,*),0.49>
<(*,*,m2),0.50>
<(p3,*,*),0.48>
© FZI Forschungszentrum Informatik 22
Top-k Query Processing
Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded
Binding operator c’=(c,xiri)
Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired
Person
id name S(r)
p1 Audrey Hepburn 0.20
p3 Katharine Hepburn 0.18
p5 Philip Hepburn 0.13
p6 Anna Hepburn 0.12
Movie
id title S(r)
m2 The Holiday 0.19
m1 Roman Holiday 0.18
m3 Holiday Blues 0.09
m4 Family Holiday 0.08
Character
id name S(r)
c1 Princess Ann 0.10
c2 Katharine Hepburn
c3 Iris Simpkins 0.05
c4 Louise
Threshold
Output K=1
Priority Queue
?pname
Hepburn
Person
type
Character
?c
type
pid_fk?m
Movie
type
Holiday
title
mid_fk
0.09
0.46
<(p1,c1,*),0.49>
<(*,c3,m2),0.44>
<(p3,*,*),0.48>
© FZI Forschungszentrum Informatik 23
Top-k Query Processing
Result candidate c=<(x1,…,xk),score> complete when all variables are bound to some resources xi =* indicates unbounded
Binding operator c’=(c,xiri)
Threshold determines upper bound for unseen resources Scheduling between SA and RA Tight bound is desired
Person
id name S(r)
p1 Audrey Hepburn 0.20
p3 Katharine Hepburn 0.18
p5 Philip Hepburn 0.13
p6 Anna Hepburn 0.12
Movie
id title S(r)
m2 The Holiday 0.19
m1 Roman Holiday 0.18
m3 Holiday Blues 0.09
m4 Family Holiday 0.08
Character
id name S(r)
c1 Princess Ann 0.10
c2 Katharine Hepburn
c3 Iris Simpkins 0.05
c4 Louise
Threshold
Output K=1
Priority Queue
?pname
Hepburn
Person
type
Character
?c
type
pid_fk?m
Movie
type
Holiday
title
mid_fk
0.09
0.46
<(*,c3,m2),0.44>
<(p3,*,*),0.48>
<(p1,c1,m1),0.48>
© FZI Forschungszentrum Informatik 24
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Experiments
© FZI Forschungszentrum Informatik 25
Experiments
Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases
Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries
Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)
Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).
RM-S: Our approach
© FZI Forschungszentrum Informatik 26
Experiments
MAP scores for all queries
Reciprocal rank for single resource queries
© FZI Forschungszentrum Informatik 27
Experiments
Precision-recall for TREC-style queries on Wikipedia
© FZI Forschungszentrum Informatik 28
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Application
Large amount of environmental data
Environmental issues stir public interests Increase transparency, awareness, responsibility, protection
Growing amount of data Public access through EU directive 2003/4/EC PortalU (Germany) http://www.portalu.de/ EDP (UK) http://www.edp.nerc.ac.uk Envirofacts (USA) http://www.epa.gov/enviro/index.html
Linking data in international context Local government databases of environmental part of LOD cloud Linked environment data for the life sciences
© FZI Forschungszentrum Informatik 29
Opportunity: mass dissemination and consumption of environmental dataThe percentage of people who actively find
environmental information is significantly lower than those who have those with frequent access to it!
Complex results CO emission values around Karlsruhe area in Germany
Analytics CO emission values around Karlsruhe area in Germany
Sorted by year Bar chart
Emission values of US and Germany Compare average Timeline visualization
© FZI Forschungszentrum Informatik 30
KOIOS – Overview
A semantic search system Exploit semantics in the data for keywords interpretation to hide
complexity of query languages and data representation Keyword search for searching structured data Lower access barriers while enabling richness of data to be fully
harnessed
Contribution Transfer research results to commercial EIS Selector mechanism
Process Input: keywords Facet-based refinement Selector (result and view template) initialization Output: query results embedded in specific views
© FZI Forschungszentrum Informatik 31
KOIOS – Architecture
© FZI Forschungszentrum Informatik 32
Facets generation
Derive facets from query results (not from query!) for refinement Attributes serve as facet categories Attribute values as facet values
E.g. for ?s Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”… Value.year: 2005,2006,…
© FZI Forschungszentrum Informatik 33
Selectors
Selector: parameterized, predefined result and view templates Data parameters: specify scope of information need, initialized to a
particular values based on facet categories and values Query parameter: additional data processing for analysis tasks
(GROUP-BY, SORT, MIN, MAX, AVERAGE etc.) Presentation parameter: visualization types (data value, data series,
data table, map-based, specific diagram type, etc.)
© FZI Forschungszentrum Informatik 34
Selector initialization
Selectors capture templates for information needs and presentation of their
results
Map facets to selectors and initialize them Applicable selectors: cover facet categories Initialize selectors based on facet values Initialized values are captured in the WHERE clause Non-initialized parameters are included in the SELECT clause
© FZI Forschungszentrum Informatik 35
Deployment
Hippolytos project (Theseus) Easy access to spatial data
warehouse (disy Cadenza) built for domain of environmental administration
Data about Emission and waste From the Baden-Württemberg Provided by:
Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden-Württemberg and Statistisches Landesamt Baden-Württemberg
© FZI Forschungszentrum Informatik 36
Facets and selectors
© FZI Forschungszentrum Informatik 37
Chart-based visualization
Map-based visualization
© FZI Forschungszentrum Informatik 40
Conclusions
Keyword search on structured data is a popular problem for which various solutions exist.
We focus on the aspect of result ranking, providing a principled approach that employs relevance models.
Experiments show that RMs are promising for searching structured data.
Top-k Query processing proposed to get only most relevant results
Application on environmental data enables intuitive Access Visualization Analysis of environmental information!
FZI F
ORS
CHUN
GSZ
ENTR
UMIN
FORM
ATIK
Thank you for your attention!
Questions?
Opportunity: mass dissemination and consumption of environmental data
Increase transparency, awareness, responsibility, protection
© FZI Forschungszentrum Informatik 42
Challenges: intuitive access and visualization of structured environmental data and analytics
The percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it!
Complex structured queriesKnowledge of the underlying data / query language
Complex structured dataHeterogeneity and distribution of environmental data is overwhelming
Complex structured resultsUnderstanding results and extracting relevant information / analytics are difficult tasks
© FZI Forschungszentrum Informatik 43
KOIOS
Semantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information
Overview and architecture
Structured query generation from keywords
Facet-based browsing and refinement
Selector initialization for final result and view construction
Implementation and deployment
Conclusions
© FZI Forschungszentrum Informatik 44
Conclusions
Replace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data
and schema to dynamically Translate keywords to queries Generate facets for results Initialize result and presentation templates
Enables intuitive Access Visualization Analysis of environmental information!
© FZI Forschungszentrum Informatik 45
© FZI Forschungszentrum Informatik 4904.04.2011
Inverted Indexprincess m1, c1
breakfast m3
hepburn m3,p1,p4,c2
melbourne p2
iris c3
holiday m1,m2,m3
breakfast m3
ann m1,c2
………. … …….
Ranking Schemes
Proximity between keyword nodes EASE:
XRank: w is the smallest text window in n that contains all search keywords
SIGMOD09 Tutorial 5004/12/2023
Ranking Schemes
Based on graph structure BANKS
Nodes: Edges :
PageRank-like methods XRank [Guo et al, SIGMOD03]
ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank
SIGMOD09 Tutorial 5104/12/2023
Ranking Schemes
TF*IDF based: Discover/EASE [Liu et al, SIGMOD06]
SPARK but not at the node level
SIGMOD09 Tutorial 5204/12/2023
df
N
avdldlss
tfQnScore
nQw
1ln
/)1(
))ln(1ln(1),(
P(w|Q) w
.077 palestinian
.055 israel
.034 jerusalem
.033 protest
.027 raid
.011 clash
.010 bank
.010 west
.010 troop
…
sample probabilities
palestinian
israeli
raids
???
q1q2q3
w
q Mk
k wMPMqPqqP
wPqqwP )|()|(
)...(
)()...|(
11
)|...( 1 wqqP k
)|( wqP
M
M
M
Relevance Model
Relevance Models