querie collborative database exploration

26
QueRIE: Collaborative Database Exploration

Upload: swamishailu

Post on 06-Nov-2015

6 views

Category:

Documents


0 download

DESCRIPTION

Interactive database exploration is a key task in information mining. However, users who lack SQL expertise or familiaritywith the database schema face great difficulties in performing this task. To aid these users, we developed the QueRIE system forpersonalized query recommendations. QueRIE continuously monitors the user’s querying behavior and finds matching patterns in thesystem’s query log, in an attempt to identify previous users with similar information needs. Subsequently, QueRIE uses these “similar”users and their queries to recommend queries that the current user may find interesting. In this work we describe an instantiation ofthe QueRIE framework, where the active user’s session is represented by a set of query fragments. The recorded fragments are usedto identify similar query fragments in the previously recorded sessions, which are in turn assembled in potentially interesting queriesfor the active user. We show through experimentation that the proposed method generates meaningful recommendations on real-lifetraces from the SkyServer database and propose a scalable design that enables the incremental update of similarities, makingreal-time computations on large amounts of data feasible. Finally, we compare this fragment-based instantiation with our previouslyproposed tuple-based instantiation discussing the advantages and disadvantages of each approach.

TRANSCRIPT

  • QueRIE: Collaborative Database Exploration

  • Relational database users employ a query interface (typically, a web-based client) to issue a series of SQL queries that aim to analyse the data and mine it for interesting information.First-time users may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. In this work we describe a framework to assist non-expert users by providing personalized query recommendations. Abstract

  • Web based browsers are used likeGenome (http://genome.ucsc.edu/)Sky Server (http://cas.sdss.org/)Personalized recommendations for keyword or free-form query interfacesA multidimensional query recommendation system : address the problem of generating recommendations for data warehouses and OLAP systemsRecommendation based on past queries using the most frequently appearing tuple values.Literature Survey

  • Literature Survey Continued...1.Hive - A petabyte scale data warehouse using hadoop(A. Thusoo et al.):Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop.Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop.Literature Survey Continued...

  • 2.QueRIE: A recommender system supporting interactive database exploration(S. Mittal, J. S. V. Varman):This demonstration presents QueRIE, a recommender system that supports interactive database exploration. This system aims at assisting non-expert users of scientific databases by generating personalized query recommendations. Drawing inspiration from Web recommender systems, QueRIE tracks the querying behavior of each user and identifies potentially interesting parts of the database related to the corresponding data analysis task by locating those database parts that were accessed by similar users in the past. It then generates and recommends the queries that cover those parts to the user.Literature Survey Continued...

  • 3.Amazon.com recommendations: Item-to-item collaborative filtering(G. Linden, B. Smith, and J. York):At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. There are three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods. Here, we compare these methods with our algorithm, which we call item-to-item collaborative filtering. Unlike traditional collaborative filtering, our algorithm's online computation scales independently of the number of customers and number of items in the product catalog. Our algorithm produces recommendations in real-time, scales to massive data sets, and generates high quality recommendations.Literature Survey Continued...

  • Personalized queries under a generalized preference model(G. Koutrika and Y. Ioannidis):In this paper, we present a preference model that combines expressivity and concision. In addition, we provide efficient algorithms for the selection of preferences related to a query, and an algorithm for the progressive generation of personalized results, which are ranked based on user interest. Several classes of ranking functions are provided for this purpose. We present results of experiments both synthetic and with real users (a) demonstrating the efficiency of our algorithms, (b) showing the benefits of query personalization, and (c) providing insight as to the appropriateness of the proposed ranking functions.

    Literature Survey Continued...

  • The basic idea behind this project is:To use user query log to analyze session summaryBased on session summary generate the target tuplesGenerate recommended queries retrieving target tuples.Re-ranking based on clarity scoresProposed Solution

  • ArchitectureRe-Ranking based on KL-Diversion

  • Fragment Based RecommendationSession summariesRecommendation seed computationGeneration of query recommendationsQuery ProcessingQuery RelaxationQuery ParsingResult Re-Ranking based on Clarity ScoresK-L Diversion MethodMethodology/ Implementation Details

  • Session Summary: The session summary vector Si for a user i consists of all the query fragments of the users past queries. Let Qi represent the set of queries posed by user i during a sessionF represent the set of all distinct query fragments recorded in the query logs. We assume that the vector SQ represents a single query Q Qi. For a given fragment F, we define SQ[] as a binary variable that represents the presence or absence of in a query Q. Then Si[] represents the importance of in session Si.

    Fragment Based Recommendations

  • Recommendation seed computation:To generate recommendations, the framework computes a predicted summary S captures the predicted degree of interest of the active user S serves as the seed for the generation of recommendations. The predicted summary is defined as follows mixing factor [0, 1] that determines the importance of the active users queriesUsing the session summaries of the past users and a vector similarity metric, we construct the (|F| x |F|) fragment-fragment matrix that contains all similarities sim(, ) , F.

    Fragment Based Recommendations

  • Predicted Summary Computation

  • Once the predicted summary Spred has been computed, the top-n fragments Fn (i.e. the fragments that have received the higher weight) are selected.Then all past queries Q, Q U Qi receive a rank QR with respect to the top-n fragments:Generation of Query Recommendation

  • Because of the plethora of slightly dissimilar queries existing in the query logs, we decided to relax them in order to increase their cardinality, and thus the probability of finding similarities between different user sessions.Query Relaxation

  • Query Parsing

  • KullbackLeibler divergence Theorem:KL divergence is a special case of a broader class of divergences called f-divergences. It was originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. It can be derived from a Bregman divergence.For discrete probability distributions P and Q, the KL divergence of Q from P is defined to be

    Our Contribution...

  • In words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P. The KL divergence is only defined if P and Q both sum to 1 and if Q(i)=0 implies P(i)=0 for all i (absolute continuity). If the quantity 0 ln 0 appears in the formula, it is interpreted as zero because

    For distributions P and Q of a continuous random variable, KL divergence is defined to be the integral:Our Contribution...

  • More generally, if P and Q are probability measures over a set X, and P is absolutely continuous with respect to Q, then the KullbackLeibler divergence from P to Q is defined as

    where dp/dq is the RadonNikodym derivative of P with respect to Q, and provided the expression on the right-hand side exists. Equivalently, this can be written as

    Our Contribution...

  • which we recognize as the entropy of P relative to Q. Continuing in this case, if Mue is any measure on X for which and

    exist, then the KL divergence from P to Q is given as

    The alogarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base.

    Our Contribution...

  • Our Contribution... The algorithm forms clusters in a bottom-up manner, as follows:1.Initially, put each article in its own cluster.2.Among all current clusters, pick the two clusters with the smallest distance.3.Replace these two clusters with a new cluster, formed by merging the two original ones.4.Repeat the above two steps until there is only one remaining cluster in the pool. Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single article clusters as its leaf nodes and a root node containing all the articles.In the clustering algorithm, we use a distance measure based on log likelihood. For articles A and B, the distance is defined as

    Agglomerative Clustering Algorithm

  • The log likelihood LL(X) of an article or cluster X is given by a unigram model:

    Here, cx(w) and px(w) are the count and probability, respectively, of word w in cluster X, and Nx is the total number of words occurring in cluster X.Notice that this definition is equivalent to the weighted information loss after merging two articles:

    Where

    To avoid expensive log likelihood recomputation after each cluster merging step, we define the distance between two clusters with multiple articles as the maximum pairwise distance of the articles from the two clusters:

    Our Contribution...

  • where C1 and C2 are two clusters, and A, B are articles from C1 and C2 , respectively.Once a cluster tree is created, we must decide where to slice the tree to obtain disjoint partitions for building cluster-specific LMs. This is equivalent to choosing the total number of clusters. There is a tradeoff involved in this choice. Clusters close to the leaves can maintain more specifics of the word distributions. However, clusters close to the root of the tree yield LMs with more reliable estimates, because of the larger amount of data.We roughly optimized the number of clusters by evaluating the perplexity of the Hub4 development test set. We created sets of 1, 5, 10, 15, and 20 article clusters, by slicing the cluster tree at different points. A backoff trigram model was built for each cluster, and interpolated with a trigram model derived from all articles for smoothing, to compensate for the different amounts of training data per cluster. Then, the set of LMs that maximizes the log likelihood of the Hub4 development data was selected. Given a cluster model set LM={LMi} , the test set log likelihood was obtained as an approximation to the mixture-of-clusters model:

    Our Contribution...

  • and P(LMi) and P(LMi | A) are the prior and posterior cluster probabilities, respectively.In training, A is the reference transcript for one story from the Hub4 development data. During testing, A is the 1-best hypothesis for the story, as determined using the standard LM.Our Contribution...

  • Re-ranking based on clarity scoreReranking algorithms can mainly be categorized into two approaches: Pseudo relevance feedback and Graph-based reranking.Pseudo relevance feedback approach display top results as relevant samples and then collects some samples that are assumed to be irrelevant.Graph-based reranking approach usually follows two assumptions. First, the disagreement between the initial ranking list and the refined ranking list should be small. Second, approach constructs a graph where the vertices are images or videos and the edges reflect their pair wise similarities.Our Contribution...

  • Thank You.

    01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14**01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*01/09/14*