databases & information retrieval maya ramanath ( further reading: combining database and...
TRANSCRIPT
Databases & Information Retrieval
Maya Ramanath
(Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G. Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009
DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007)
DB and IR: Different Motivations
• Both deal with large amounts of information, but…
DB IR
Applications online reservation, banking
libraries
Emphasis data consistency, efficiency
result quality, user satisfaction
Data structured records
unstructured text
Queries precise interpretations vary
Results exact match/all results
ranked/top-k results
Why Combine Now?
• The applications drive the need– The need to manage both structured
and unstructured data in an integrated manner
• Healthcare example– Find young patients in central Europe
who have been reported, in the last two weeks, to have symptoms of tropical virus diseases and an indication of anomalies.
• Newspaper archives, product catalogues, etc.
Integrating DB & IR
top-k processing,
keyword search on graphs
IR Systems
extracting entities and
relationships, ranking for
entities
DB SystemsStructured queries / boolean match results(SQL)
Untructured queries / ranked results(keywords/top-k)
Structured data(relational)
Unstructured data(text)
query processing for text search,effective query interfaces,ranking for structured data
Modules
1. Top-k processing2. Query Processing and Interfaces3. Keyword Search on Graphs4. Entity and Relationship Extraction5. Ranking and Structured Data
1. Top-k Processing (1/2)
• Structured data, with scores in multiple dimensions
• Return the top-k “objects”
Car Color
BMW X1 0.9
Honda City
0.8
Maruti Swift
0.6
Tata Nano
0.1
Car Mileage
Honda City
0.8
Maruti Swift
0.6
Tata Nano
0.3
BMW X1 0.1
Car Service
Tata Nano
0.7
Maruti Swift
0.6
Honda City
0.3
BMW X1 0.1
1. Top-k Processing (2/2)
• Top-k Joins– Example: Return the best house-school
pair
Houses
Rating
Location
H1 0.9 L1
H2 0.8 L2
H3 0.6 L3
H4 0.1 L3
Schools
Rating
Location
S1 0.4 L2
S2 0.2 L2
S3 0.8 L3
S4 0.1 L3
2. Query Processing and Interfaces (1/3)
• Given: Database of text documents and a text-centric task.– Extract information about disease
outbreaks
• Strategies– Scan all documents – very expensive– Filter promising documents – affects
recall
• Develop cost models and execution strategies appropriate for this setting
2. Query Processing and Interfaces (2/3)
Querying with “typed” keywords• Keyword querying: Easy to use• Structured queries: PreciseFind the middle ground…
Instead of“german has won nobel award”q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y)“german, has won (nobel award)”
2. Query Processing and Interfaces (3/3)
• Does the output have to be a boring list of ranked results?
• Nope !
3. Keyword Search on Graphs (1/3)
• Lots of graphs around– Relational DB (tuples+foreign keys)– XML data
(elements/sub-elements/id/idrefs)– RDF (graph-structured knowledge-
bases)
• Easy to query with keywords, instead of SQL/XQuery/SPARQL
• Results are the top-k interconnections between the keywords
3. Keyword Search on Graphs (2/3)
3. Keyword Search on Graphs (3/3)
Query: “Einstein”, “Bohr”
vegetarian
Tom Cruise
1962
isa isabornIn
diedIn
Einstein
BohrNobel Prizewon
won
4. Entity and Relationship Extraction (1/2)
Information Extraction (or Knowledge Harvesting)
Bill Gates was the founder of Microsoft and later it’s CEO.
Apple was established on April 1, 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne.
Infosys was founded on 2 July 1981 by seven entrepreneurs: N. R. Narayana Murthy, Nandan Nilekani, …
Company Founder
Microsoft Bill Gates
Apple Steve Jobs
Apple Steve Wozniak
Infosys N. R. Narayana Murthy
4. Entity and Relationship Extraction (2/2)
• How to build a knowledge-base of facts?– Structurize Wikipedia– Construct rules for extraction
• How do I acquire all the facts in the world?– Extract “everything”– Don’t stop extracting
5. Ranking and Structured Data
• Not the same as top-k processing• Given: Data with stucture in it– Relational tables (flat)– XML (trees/graphs)– Text documents consisting of entities
• Task: Rank the query results– SQL/Xquery/”typed” keywords
QUESTIONS?