a query model for ad hoc queries using a scanning architecture

A Query Model for Ad Hoc Queries using a Scanning Architecture

Erik Freed Brian Anderson Flurry/Yahoo Flurry/Yahoo

erikfreed@yahooinc.com briananderson@yahooinc.com

Abstract Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up relevant analytics queries and then wait days for the results. If the results prompt new questions or the original query is not quite right, the user must rerun and wait again for the results.

We present the Burst system developed at Flurry to support lowlatency single pass queries over very large and complex mobile application streams. We have created a data schema and query model that can answer very complex adhoc queries over data, and is highly parallelizable while maintaining lowlatency. We implement these scans so that they are time and space efficient using the advanced disk scanning techniques provided by the underlying operating system.

1. Introduction Flurry gathers mobile application analytics for over 500,000 applications on hundreds of millions of devices. Currently we have accumulated petabytes of metrics across a 2000 node hbase cluster. 1000 node hadoop jobs run throughout the day to calculate data for graphs and displays that appear in the Flurry Developer Portal. Metrics and graphs have been previously specified by the user of the developer portal. If the developer wants to explore the data with new graphs or metrics, they must change the definitions and wait days for the next job run.

Many of these metrics are based on a series of timeordered dependent events. For example, funnels define a cohort [COHORT] event that partitions the application’s users into groups. Afterwards, funnel events track significant events performed by users in that cohort. These metrics are not traditional associative and commutative aggregation functions but instead need finite state machine functions to calculate.

We developed Explorer as an adhoc query product that allows the developer to interactively explore their application metrics and get graphs and charts in subsecond time. This allows the user to do iterative deepdives into their application statistics in order to increase the retention and revenue of their application. The Burst system is the backend storage and query system that supports the Explorer product.

This paper will discuss how Burst has chosen to focus on a scanonly architecture for processing very large amounts of data. We will cover the Burst data and query execution model. The underlying architecture and implementation of Burst is covered in more detail elsewhere [BURST].

2. Background An adhoc query is one where the execution engine cannot predict what form the question will take. In the world of mobile analytics, developers are constantly asking iterative questions about their users and their usage of an application so they can improve adoption, retention and ultimately increase revenue. The answer from one query drives the next, so the turnaround for results must be subsecond. While the developer is sifting through the timeordered record of events performed by a user in one or more of their applications, they are doing multidimensional aggregates as well as temporal and causal analysis in the form of cohort and funnel analysis [COHORT]. Flurry provides analytics as a service to hundreds of thousands of applications so the Burst system supports hundreds of simultaneous queries by developers analyzing their event data.

page 1 of 6

mailto:[email protected]

mailto:[email protected]

A Model for Ad Hoc Queries over Large Datasets using a Scan Optimized Architecture

Adhoc queries with medium size datasets are traditionally being executed using a relational tuple model [CODD] with the declarative query language SQL implemented on a medium to large single machine. As the size of the dataset grows, processing is distributed across multiple machines to increase parallelism.

In order to make such an adhoc query system perform well, the backend system devises complex query plans to reduce table scans, find the smallest join cardinalities and utilize data indexes. Data is organized on disk for easy scans and index lookup. The OS helps out with lowlevel APIs to load the contents of the disk directly into memory as well as precaching to anticipate reads. For distributed query engines, the cost of shuffling data between nodes must be included while planning the query [DATE][HIVE].

However as the size of data or the complexity of the query grows, query performance degrades until usable interactive adhoc queries are no longer possible. The database expert is called in to denormalize the tables and precompute results to restore performance at the expense of query flexibility. In spite of this effort, the causal analysis of cohort analysis can become too challenging for the traditional index and join models of databases. At this point, the developer may have to look to alternative implementation strategies to regain performance, such as using column oriented storage structures [DRUID] [REDSHIFT], time series databases [TSDB] or data warehouses [HIVE], but with the result of giving up query flexibility or complexity, especially in the area of cohort analysis.

3. Burst In developing the Explorer product, we decided to take the approach of assuming that a full scan of the data is required and focused on building the burst engine to do this as fast and efficiently as possible. Once we assumed that we were always going to scan everything, we simplified the execution model so we could accomplish everything we needed in a single scan. We restricted the use of join as well as its more generalized cousin, the sort, to create a single pass execution model that scales linearly. As the dataset sizes grow, we can bring more hardware online to maintain lowlatency query performance.

3.1. The Data Model The flurry SDK continuously records events produced by an application. Ultimately, all this data is gathered into a user object with references and collections of child objects chronicling those events the user performed:

Burst supports metadata defined data schemas that are a generalized from the type of analytics data the Flurry SDK gathers from mobile applications. The data model has the following characteristics:

A dataset is a homogenous collection of items. Each item is a rooted hierarchical object consisting of scalar value fields, vector value fields, scalar

reference fields, vector reference fields, and value to value map fields. Values can be one of the following types: byte, short, integer, long, double, boolean, and string. Vectors are collections of scalars or references to items that can optionally be ordered on some key.

page 2 of 6


References point to other items, but an item can only be referenced once: either as the member of a dataset or in a scalar reference field or a scalar vector field. 1

An item is versioned so that the data schema can evolve over time. A dataset as well as scalar and vector reference fields can contain items at different versions.

3.2. Query Model Use cases for mobile analytics are inherently unbounded, personalized, and constantly evolving. Queries range from simple counting aggregations, multidimensional aggregations, up to complex time sequence conditionals. Some examples include:

Count the number of users by day with sessions where they spent more than $5 on inapp purchases Count the number of users by day with sessions where they made a bet and then made an inapp purchase

to buy more gold before betting again. 2

The low level Burst query execution engine scans the dataset and produces a collection of tuples for each item. A tuple consists of a number of fields:

An aggregation field is one of the following functions: count, sum, , topK, max, min . These functions are 3

associative and commutative so they can be applied in any order. A dimension field is a scalar value that can partition data using group functions: enum, splits, month, day,

year, and a menagerie of other time partitioning functions.

As a tuple is created during the scan, it is combined with any existing tuple with matching values in all the dimension fields. There is no ordering of the scan of the items in dataset, and with the assurance of the associativity and commutativity restrictions, the scan can be done in any order and split up into any number of arbitrary streams that are merged together into a final set of tuples. This result set of tuples is the query result.

Each item is evaluated by traversing it in a depth first search manner starting at the root item and visiting each referenced item in a scalar or vector reference field. The evaluation is done by a single thread and is guaranteed to traverse the items by following the item relationships, as defined by the scalar and vector reference fields of an

1 A tree graph. 2 This filter on a timeordered series of events would require one or more subqueries in a relational system and can typically be unimplementable in time series databases. 3 This is an approximated singlepass topK.

page 3 of 6


item, in DFS order. During the traversal, tuple fields are assigned to build partial results for this item. However, while the traversal can be shortcircuited it cannot back track to reexamine items already visited.

During evaluation, the query has two temporary data structures to help it keep state information about the item:

A global register can store a single scalar value. It can be set and/or reference at any point in the traversal. A route is defined by a number of steps as well as the valid transitions from one step to another (with

optional time constraints). The route has at least one starting step with no transitions into it, and one terminating step with exit transitions . The route is finite state machine which logs any valid step transition 4

along with a timestamp. One can assert an step occurrence to the route any time, but the route object will only record the assertion if a transition from the current route state is allowed. As the route records step transitions, it cuts the log into paths that always begin with a starting step and finish with a terminating step. At any time in the traversal the route can be queried for any currently recorded steps or completed paths.

An evaluation can use multiple instances of each type which are valid within the scope of a single item evaluation. They allow the engine to record enough state to evaluate complex event queries in a single pass.

Evaluation is defined so that it always advancing through an item. The underlying storage layout of items in Burst can take advantage of this property in order to make evaluation very fast and efficient [BURST]. Some important sources of speedup are:

1. memory cache line prefetching in high end CPUs 2. disk head readhead in the disk controller 3. disk to memory prefetching in the disk controller and OS 4. single copy memory mapping support in the OS

Imperative Query Plans The Burst execution engine scans a dataset using a GIST plan. The GIST plan is imperative execution plan that specifies the schema of the result tuples and what actions to perform during an item evaluation. Just to give the reader a sense of a GIST plan, the following example calculates the total the length of all sessions for all users in flurry’s usual metric event schema:

Gist(Over(1L, 512, "America/Los_Angeles"), NoOptions, Declare( Gather("user", NoDimensions, Aggregations( Sum[Long]("total session time") ) )), VisitReferenceVector("user.sessions", pre=NoPre, post=Post s ⇒ if (!s.fldIsNull("user.sessions", "duration")) s.aggLongWr("total session time", s.fldScalarLong("user.sessions", "duration")) ))

This doesn’t come close to showing the full power of GIST, but a more meaningful analytics query would be quite large. Notice, the plan consists of a number of gather clauses each with a path identifying a location in the data schema plus a number of optional aggregate and dimension fields. There is always one root gather at the top of the schema as well as optional nested gathers. A gather defines a join point where results are built from partial results of the children. A gather also has a number of visit declarations each with a path and a closure. The closure is

4 They may even be the same node.

page 4 of 6


executed when the traversal visits any item residing at that point in the schema. A closure can record an aggregation 5

or dimension defined in the enclosing gather or record information in a global register. A closure at some node only has visibility to the current visited item instance and all parent instances on the path; it can never access child items directly. At every gather in the execution, the executor collects all the partial rows for created by its visit closures as well as partial rows from child gathers. When the gather finishes, the partial rows are cross joined together to produce new partial tuples composed of all the fields. Partial results are created at the visits and then composed into larger results at the gathers. At the root, we end up with the total set of complete tuples for the item. A GIST plan can also use a route structure in a gather to help evaluate causal and temporal data filters. A visit closure will assert the occurrence of an step based on some test of the current item’s field values, but the route will only record an step assertion if a transition for the current route state is allowed. A closure can also test for the occurrence of previous successful steps. For example, in order to find users that made a bet and then purchased something in a single session, the GIST plan will declare a route with a simple linear progression of three steps. At a session item visit, the closure asserts the first step. At the event item level the closure asserts step two if it sees a bet type event or step three if it sees an inapp purchase. Finally, when the root item is visited, if the route has at least one path, then this user satisfies the filter. Some of the cohort analysis queries in the Explorer product use two interacting funnels for their filtering. Declarative Queries Burst provides a highlevel declarative query language for users and applications using the Burst system: SILQ. SILQ queries insulate the user from much of complexities inherent in writing GIST plans.

The SILQ compiler takes this text and builds a GIST plan for it. The SILQ equivalent to the previous GIST example totalling the durations of all sessions for an applications is : 6

over ("quo", 1) aggregate ( "total session time" as sum(user.sessions.duration); )

SILQ supports all of the features of our scan query engine that GIST does, including funnels, but in a declarative form. The compiler builds the GIST plan by:

choosing the best gather nesting and field layout generating visit closures for condition tests use registers to support complex tests of collection conditions such as “where sum(events.duration) > 1”

4. Conclusions The Burst system supports the constantly changing, complex causal based queries needed for modern application analytics. We use a unique single pass scan model in order to satisfy these requirements, remain lowlatency, and have a system that scales with the amount of data. The current Burst system, is the backend for Flurry’s Explorer system which is running in production. The system does subsecond adhoc queries over 60G datasets including complex funnel and user retention calculations that take hours in Flurry’s traditional hadoop infrastructure. The first major release of the back end query engine engineered to fully support this type of exploration was developed in the Flurry Yahoo Mobile Analytics group and released in Feb 2015 to a beta customer group. The next major release of this back end system being implemented will improve the SILQ and GIST languages by adding new features as well as usability improvements.

5 This will probably remind many users of the execution model for the SAX XML parser. 6 In fact, the previous GIST was directly generated by the SILQ compiler from this example.

page 5 of 6


References [BURST] Erik Freed and Brian Anderson, “A General Purpose Extensible Scanning Query Architecture for

Ad Hoc Analytics” [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:

http://blinkdb.org/ [CODD] E.F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the

ACM, 1970: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf [COHORTS] Wikipedia, “Cohort Analysis”: https://en.wikipedia.org/wiki/Cohort_analysis [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”: http://druid.io/ [DATE] C. J. Date. An Introduction to Database Systems. O’Reilly, 7 edition, 2000 [DRILL] MAPR, “Industry's First SchemaFree SQL Engine for Big Data”:

https://www.mapr.com/products/apachedrill [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva

Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of WebScale Datasets”, Proc. of the 36th Int'l Conf on Very Large Data Bases: http://research.google.com/pubs/pub36632.html

[HIVE] Apache Software Foundation, “Hive: A data warehouse infrastructure built on top ofHadoop for providing data summarization, query, and analysis“: https://hive.apache.org/

[TSDB] OpenTSDB, “The Scalable Time Series Database”: http://opentsdb.net/ [REDHIFT] Amazon Web Services, “Amazon Redshift”: http://aws.amazon.com/redshift/

page 6 of 6

http://blinkdb.org/

http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

https://en.wikipedia.org/wiki/Cohort_analysis

http://druid.io/

https://www.mapr.com/products/apache-drill

http://research.google.com/pubs/pub36632.html

https://en.wikipedia.org/wiki/Data_warehouse

https://en.wikipedia.org/wiki/Hadoop

https://hive.apache.org/

http://opentsdb.net/

http://aws.amazon.com/redshift/

a query model for ad hoc queries using a scanning architecture

Engineering