# chapter 16: keyword search

Click here to load reader

Post on 22-Jan-2016

69 views

Embed Size (px)

DESCRIPTION

CHAPTER 16: KEYWORD SEARCH. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Keyword Search over Structured Data. Anyone who has used a computer knows how to use keyword search No need to understand logic or query languages - PowerPoint PPT PresentationTRANSCRIPT

PRINCIPLES OFDATA INTEGRATION

Keyword Search over Structured DataAnyone who has used a computer knows how to use keyword searchNo need to understand logic or query languagesNo need to understand (or have) structure in the dataDatabase-style queries are more precise, but:Are more difficult for users to specifyRequire a schema to query over!Constructing a mediated, queriable schema is one of the major challenges in getting a data integration system deployedCan we use keyword search to help?

The FoundationsKeyword search was studied in the database context before being extended to data integration

Well start with these foundations before looking at what is different in the integration contextHow we model a database and the keyword search problemHow we process keyword searches and efficiently return the top-scoring (top-k) results

OutlineBasic conceptsData graphKeyword matching and scoring modelsAlgorithms for ranked resultsKeyword search for data integration

The Data GraphCaptures relationships and their strengths, among data and metadata itemsNodesClasses, tables, attributes, field valuesMay be weighted representing authoritativeness, quality, correctness, etc.Edgesis-a and has-a relationships, foreign keys, hyperlinks, record links, schema alignments, possible joins, May be weighted representing strength of the connection, probability of match, etc.

Querying the Data GraphQueries are expressed as sets of keywords

We match keywords to nodes, then seek to find a way to connect the matches in a tree

The lowest-cost tree connecting a set of nodes is called a Steiner treeFormally, we want the top-k Steiner trees However, this is NP-hard in the size of the graph!

Data Graph Example Gene Terms, Classifications, PublicationsBlue nodes represent tablesGenetic terms, record link to ontology, record link to publications, etc.Pink nodes represent attributes (columns)Brown rectangles represent field valuesEdges represent foreign keys, membership, etc.

Querying the Data GraphmembranepublicationRelational query 1 tree: Term, Term2Ontology, Entry2Pub, PubsRelational query 2 tree: Term, Term2Ontology, Entry, PubstitleAn index to tables,not part of results

Trees to Ranked ResultsEach query Steiner tree becomes a conjunctive queryReturn matching attributes, keys of matching relationsNodes relation atoms, variables, bound valuesEdges join predicates, inclusion, etc.Keyword matches to value nodes selection predicatesQuery tree 1 becomes:q1(A,P,T) :- Term(A, plasma membrane), Term2Ontology(A, E), Entry2Pub(E, P), Pubs(P, T)Computing and executing this query yields resultsAssign a score to each, based on the weights in the query and similarity scores from approximate joins or matches

Where Do Weights Come from?Node weights:Expert scoresPageRank and other authoritativeness scoresData quality metricsEdge weights:String similarity metrics (edit distance, TF*IDF, etc.)Schema matching scoresProbabilistic matches

In some systems the weights are all learned

Scoring Query ResultsThe next issue: how to compose the scores in a query treeWeights are treated as costs or dissimilaritiesWe want the k lowest-costTwo common scoring models exist:Sum the edge weights in the query treeThe tree may have a required root (in some models), or notIf there are node weights, move onto extra edges see textSum the costs of root-to-leaf edge costsThis is for trees with required rootsThere may be multiple overlapping root leaf pathsCertain edges get double-counted, but they are independent

OutlineBasic conceptsAlgorithms for ranked resultsKeyword search for data integration

Top-k AnswersThe challenge efficiently computing the top-k scoring answers, at scaleTwo general classes of algorithmsGraph expansion -- score is based on edge weightsModel data + schema as a single graphUse a heuristic search strategy to explore from keyword matches to find treesThreshold-based merging score is a function of field valuesGiven a scoring function that depends on multiple attributes, how do we merge the results?Often combinations of the two are used

Graph ExpansionBasic process:Use an inverted index to find matches between keywords and graph nodesIteratively search from the matches until we find treesTermTerm2 OntologyEntry2 PubPubsaccname...go_identry_acentry_acpub_id...pub_id...titleGO:00059plasma membrane...membranetitle

What Is the Expansion Process?Assumptions here:Query result will be a rooted tree -- root is based on direction of foreign keysScoring model is sum of edge weights (see text for other cases)Two main heuristics:Backwards expansionCreate a cluster for each leaf nodeExpand by following foreign keys backwards: lowest-cost-firstRepeat until clusters intersectBidirectional expansionAlso have a cluster for the root nodeExpand clusters in prioritized way

Querying the Data Graphmembranepublicationtitle

Graph vs. Attribute-Based ScoresThe previous strategy focuses on finding different subgraphs to identify the tuples to returnAssumes the costs are defined from edge weightsUses prioritized exploration to find connectionsBut part of the score may be defined in terms of the values of specific attributes in the queryscore = + weight1 * T1.attrib1 + weight2 * T2.attrib2 + Assume we have an index of partial tuples by sort order of the attributes and a way of computing the remaining results e.g., by joining the partial tuples with others

Threshold-based Merging with Random AccessGiven multiple sorted indices L1, , Lm over the same stream of tuples try to return the k best-cost tuples with the fewest I/OsAssume cost function t(x1,x2,x3,, xm) is monotone, i.e., t(x1,x2,x3,, xm) t(x1,x2, x3, , xm) whenever xi xi for every iAssume we can retrieve/compute tuples with each xiL1: Index on x1L2: Index on x2Lm: Index on xmThreshold-based Mergek best ranked resultscost = t(x1,x2,x3,, xm)

The Basic Thresholding Algorithm with Random Access (Sketch)In parallel, read each of the indices LiFor each xi retrieved from Li retrieve the tuple RObtain the full set of tuples R containing Rthis may involve computing a join query with RCompute the score t(R) for each tuple R RIf t(R) is one of the k-best scores, remember R and t(R)break ties arbitrarilyFor each index Li let xi be the lowest value of xi read from the indexSet a threshold value = t(x1, x2, , xm)Once we have seen k objects whose score is at least equal to , halt and return the k highest-scoring tuples that have been remembered

An Example: Tables & IndicesFull data:Lrating: Index by ratingsLprice: Index by (5 - price)

namelocationratingpriceAlma de Cuba1523 Walnut St.43Moshulu401 S. Columbus bldv.44Sotto Varalli231 S. Broad St.3.5.3Mcgillins1310 Drury St.42Di Nardos Seafood312 Race st.32

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*4 + 0.5*3 = 3.5no tuples above !

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*4 + 0.5*3 = 3.5no tuples above !

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5these have already been read!

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

tsotto = 0.5*3.5 + 0.5*2 = 2.75tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*3.5 + 0.5*2 = 2.75

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

tsotto = 0.5*3.5 + 0.5*2 = 2.75tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*3.5 + 0.5*2 = 2.753 are above threshold

ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

(5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

Summary of Top-k AlgorithmsAlgorithms for producing top-k results seek to minimize the amount of computation and I/OGraph-based methods start with leaf and root nodes, do a prioritized searchThreshold-based algorithms seek to minimize the amount of full computation that needs to happenRequire a way of accessing subresults by each score component, in decreasing order of the score componentThese are the main building blocks to keyword search over databases, and sometimes used in combination

OutlineBasic conceptsAlgorithms for ranked resultsKeyword search for