chapter 16: keyword search

Click here to load reader

Post on 22-Jan-2016

69 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

CHAPTER 16: KEYWORD SEARCH. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Keyword Search over Structured Data. Anyone who has used a computer knows how to use keyword search No need to understand logic or query languages - PowerPoint PPT Presentation

TRANSCRIPT

  • PRINCIPLES OFDATA INTEGRATION

  • Keyword Search over Structured DataAnyone who has used a computer knows how to use keyword searchNo need to understand logic or query languagesNo need to understand (or have) structure in the dataDatabase-style queries are more precise, but:Are more difficult for users to specifyRequire a schema to query over!Constructing a mediated, queriable schema is one of the major challenges in getting a data integration system deployedCan we use keyword search to help?

  • The FoundationsKeyword search was studied in the database context before being extended to data integration

    Well start with these foundations before looking at what is different in the integration contextHow we model a database and the keyword search problemHow we process keyword searches and efficiently return the top-scoring (top-k) results

  • OutlineBasic conceptsData graphKeyword matching and scoring modelsAlgorithms for ranked resultsKeyword search for data integration

  • The Data GraphCaptures relationships and their strengths, among data and metadata itemsNodesClasses, tables, attributes, field valuesMay be weighted representing authoritativeness, quality, correctness, etc.Edgesis-a and has-a relationships, foreign keys, hyperlinks, record links, schema alignments, possible joins, May be weighted representing strength of the connection, probability of match, etc.

  • Querying the Data GraphQueries are expressed as sets of keywords

    We match keywords to nodes, then seek to find a way to connect the matches in a tree

    The lowest-cost tree connecting a set of nodes is called a Steiner treeFormally, we want the top-k Steiner trees However, this is NP-hard in the size of the graph!

  • Data Graph Example Gene Terms, Classifications, PublicationsBlue nodes represent tablesGenetic terms, record link to ontology, record link to publications, etc.Pink nodes represent attributes (columns)Brown rectangles represent field valuesEdges represent foreign keys, membership, etc.

  • Querying the Data GraphmembranepublicationRelational query 1 tree: Term, Term2Ontology, Entry2Pub, PubsRelational query 2 tree: Term, Term2Ontology, Entry, PubstitleAn index to tables,not part of results

  • Trees to Ranked ResultsEach query Steiner tree becomes a conjunctive queryReturn matching attributes, keys of matching relationsNodes relation atoms, variables, bound valuesEdges join predicates, inclusion, etc.Keyword matches to value nodes selection predicatesQuery tree 1 becomes:q1(A,P,T) :- Term(A, plasma membrane), Term2Ontology(A, E), Entry2Pub(E, P), Pubs(P, T)Computing and executing this query yields resultsAssign a score to each, based on the weights in the query and similarity scores from approximate joins or matches

  • Where Do Weights Come from?Node weights:Expert scoresPageRank and other authoritativeness scoresData quality metricsEdge weights:String similarity metrics (edit distance, TF*IDF, etc.)Schema matching scoresProbabilistic matches

    In some systems the weights are all learned

  • Scoring Query ResultsThe next issue: how to compose the scores in a query treeWeights are treated as costs or dissimilaritiesWe want the k lowest-costTwo common scoring models exist:Sum the edge weights in the query treeThe tree may have a required root (in some models), or notIf there are node weights, move onto extra edges see textSum the costs of root-to-leaf edge costsThis is for trees with required rootsThere may be multiple overlapping root leaf pathsCertain edges get double-counted, but they are independent

  • OutlineBasic conceptsAlgorithms for ranked resultsKeyword search for data integration

  • Top-k AnswersThe challenge efficiently computing the top-k scoring answers, at scaleTwo general classes of algorithmsGraph expansion -- score is based on edge weightsModel data + schema as a single graphUse a heuristic search strategy to explore from keyword matches to find treesThreshold-based merging score is a function of field valuesGiven a scoring function that depends on multiple attributes, how do we merge the results?Often combinations of the two are used

  • Graph ExpansionBasic process:Use an inverted index to find matches between keywords and graph nodesIteratively search from the matches until we find treesTermTerm2 OntologyEntry2 PubPubsaccname...go_identry_acentry_acpub_id...pub_id...titleGO:00059plasma membrane...membranetitle

  • What Is the Expansion Process?Assumptions here:Query result will be a rooted tree -- root is based on direction of foreign keysScoring model is sum of edge weights (see text for other cases)Two main heuristics:Backwards expansionCreate a cluster for each leaf nodeExpand by following foreign keys backwards: lowest-cost-firstRepeat until clusters intersectBidirectional expansionAlso have a cluster for the root nodeExpand clusters in prioritized way

  • Querying the Data Graphmembranepublicationtitle

  • Graph vs. Attribute-Based ScoresThe previous strategy focuses on finding different subgraphs to identify the tuples to returnAssumes the costs are defined from edge weightsUses prioritized exploration to find connectionsBut part of the score may be defined in terms of the values of specific attributes in the queryscore = + weight1 * T1.attrib1 + weight2 * T2.attrib2 + Assume we have an index of partial tuples by sort order of the attributes and a way of computing the remaining results e.g., by joining the partial tuples with others

  • Threshold-based Merging with Random AccessGiven multiple sorted indices L1, , Lm over the same stream of tuples try to return the k best-cost tuples with the fewest I/OsAssume cost function t(x1,x2,x3,, xm) is monotone, i.e., t(x1,x2,x3,, xm) t(x1,x2, x3, , xm) whenever xi xi for every iAssume we can retrieve/compute tuples with each xiL1: Index on x1L2: Index on x2Lm: Index on xmThreshold-based Mergek best ranked resultscost = t(x1,x2,x3,, xm)

  • The Basic Thresholding Algorithm with Random Access (Sketch)In parallel, read each of the indices LiFor each xi retrieved from Li retrieve the tuple RObtain the full set of tuples R containing Rthis may involve computing a join query with RCompute the score t(R) for each tuple R RIf t(R) is one of the k-best scores, remember R and t(R)break ties arbitrarilyFor each index Li let xi be the lowest value of xi read from the indexSet a threshold value = t(x1, x2, , xm)Once we have seen k objects whose score is at least equal to , halt and return the k highest-scoring tuples that have been remembered

  • An Example: Tables & IndicesFull data:Lrating: Index by ratingsLprice: Index by (5 - price)

    namelocationratingpriceAlma de Cuba1523 Walnut St.43Moshulu401 S. Columbus bldv.44Sotto Varalli231 S. Broad St.3.5.3Mcgillins1310 Drury St.42Di Nardos Seafood312 Race st.32

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*4 + 0.5*3 = 3.5no tuples above !

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*4 + 0.5*3 = 3.5no tuples above !

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5these have already been read!

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • tsotto = 0.5*3.5 + 0.5*2 = 2.75tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*3.5 + 0.5*2 = 2.75

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • tsotto = 0.5*3.5 + 0.5*2 = 2.75tmoshulu = 0.5*4 + 0.5*1 = 2.5tdinardos = 0.5*3 + 0.5*3 = 2.5Reading and Merging ResultsLratingsLpricetalma = 0.5*4 + 0.5*2 = 3Cost formula: t(rating,price) = rating * 0.5 + (5 - price) * 0.5tmcgillins = 0.5*4 + 0.5*3 = 3.5 = 0.5*3.5 + 0.5*2 = 2.753 are above threshold

    ratingname4Alma de Cuba4Moshulu4Mcgillins3.5Sotto Varalli3Di Nardos Seafood

    (5-price)name3McGillins3Di Nardos Seafood2Alma de Cuba2Sotto Varalli1Moshulu

  • Summary of Top-k AlgorithmsAlgorithms for producing top-k results seek to minimize the amount of computation and I/OGraph-based methods start with leaf and root nodes, do a prioritized searchThreshold-based algorithms seek to minimize the amount of full computation that needs to happenRequire a way of accessing subresults by each score component, in decreasing order of the score componentThese are the main building blocks to keyword search over databases, and sometimes used in combination

  • OutlineBasic conceptsAlgorithms for ranked resultsKeyword search for