apache drill status

Apache Drill status 04/2013

Apache Drill statusMichael Hausenblas, Chief Data Engineer EMEA, MapRHUG Munich, 2013-04-19

1

Kudos to http://cmx.io/ WorkloadsBatch processing (MapReduce)

Light-weight OLTP (HBase, Cassandra, etc.)

Stream processing (Storm, S4)

Search (Solr, Elasticsearch)

Interactive, ad-hoc query and analysis (?)

Impala

Interactive Query at Scale

low-latencyHive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time4Use Case I

Jane, a marketing analystDetermine target segmentsData from different sources

Suppose a marketing analyst trying to experiment with ways to do targeting of user segments for next campaign. Needs access to web logs stored in Hadoop, and also needs to access user profiles stored in MongoDB as well as access to transaction data stored in a conventional database. 5Use Case IILogistics supplier statusQueriesHow many shipments from supplier X?How many shipments in region Y?SUPPLIER_IDNAMEREGIONACMACME CorpUSGALGotALot IncUSBAPBits and Pieces LtdEuropeZUPZu PliAsia{ "shipment": 100123, "supplier": "ACM", timestamp": "2013-02-01", "description": first delivery today},{ "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it}6Todays SolutionsRDBMS-focusedETL data from MongoDB and HadoopQuery data using SQL

MapReduce-focusedETL from RDBMS and MongoDBUse Hive, etc.7RequirementsSupport for different data sourcesSupport for different query interfacesLow-latency/real-timeAd-hoc queriesScalable, reliable

Re ad-hoc:You might not know ahead of time what queries you will want to make. You may need to react to changing circumstances. 8Googles Dremel*

*) http://research.google.com/pubs/pub36632.html

Two innovations: handle nested-data column style (column-striped representation) and query push-down9Apache Drill OverviewInspired by Googles DremelStandard SQL 2003 supportOther QL possiblePlug-able data sourcesSupport for nested dataSchema is optionalCommunity driven, open, 100s involved

High-level Architecture

High-level ArchitectureEach node: Drillbit - maximize data localityCo-ordination, query planning, execution, etc, are distributedBy default Drillbits hold all rolesAny node can act as endpoint for a queryStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeDrillbits per node, maximize data localityCo-ordination, query planning, optimization, scheduling, execution are distributedBy default, Drillbits hold all roles, modules can optionally be disabled.Any node/Drillbit can act as endpoint for particular query.12High-level ArchitectureZookeeper for ephemeral cluster membership infoDistributed cache (Hazelcast) for metadata, locality information, etc.Curator/ZkDistributed CacheStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeDistributed CacheDistributed CacheDistributed CacheZookeeper maintains ephemeral cluster membership information onlySmall distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.13High-level ArchitectureOriginating Drillbit acts as foreman, manages query execution, scheduling, locality information, etc.Streaming data communication avoiding SerDeCurator/ZkDistributed CacheStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeStorage ProcessDrillbitnodeDistributed CacheDistributed CacheDistributed CacheOriginating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information.Drillbit data communication is streaming and avoids any serialization/deserializationRed arrow: originating drillbit, is the root of the multi-level serving tree, per query

14Principled Query ExecutionSource QueryParserLogical PlanOptimizerPhysical PlanExecutionSQL 2003 DrQLMongoQLDSLscanner APItopologyquery: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: logs }, { op: "filter", condition: "x > 3 }, parser APISource query - Human (eg DSL) or tool written(eg SQL/ANSI compliant) query Source query is parsed and transformed to produce the logical planLogical plan: dataflow of what should logically be doneTypically, the logical plan lives in memory in the form of Java objects, but also has a textual formThe logical query is then transformed and optimized into the physical plan.Optimizer introduces of parallel computation, taking topology into accountOptimizer handles columnar data to improve processing speedThe physical plan represents the actual structure of computation as it is done by the system How physical and exchange operators should be appliedAssignment to particular nodes and cores + actual query execution per node

15Drillbit ModulesDFS EngineHBase EngineRPC EndpointSQLHiveQLPigParserDistributed CacheLogical PlanPhysical PlanOptimizerStorage Engine InterfaceSchedulerForemanOperatorsMongoKey FeaturesFull SQL 2003Nested dataOptional schemaExtensibility pointsFull SQL ANSI SQL 2003SQL-like is often not enoughIntegration with existing toolsDatameer, Tableau, Excel, SAP Crystal ReportsUse standard ODBC/JDBC driver

Nested Data

Nested data becoming prevalentJSON/BSON, XML, ProtoBuf, AvroSome data sources support it natively(MongoDB, etc.)Flattening nested data is error-proneExtension to ANSI SQL 2003Optional SchemaMany data sources dont have rigid schemasSchema changes rapidlyDifferent schema per record (e.g. HBase)Supports queries against unknown schemaUser can define schema or via discovery

20Extensibility PointsSource query parser APICustom operators, UDF logical planServing tree, CF, topology physical plan/optimizerData sources &formats scanner APISource QueryParserLogical PlanOptimizerPhysical PlanExecution and Hadoop?HDFS can be a data source

Complementary use cases*

use Apache DrillFind record with specified conditionAggregation under dynamic conditions

use MapReduceData mining with multiple iterationsETL

22*) https://cloud.google.com/files/BigQueryTechnicalWP.pdf Relation of Drill to HadoopHadoop = HDFS + MapReduceDrill for:Finding particular records with specified conditions. For example, to find request logs with specified account ID.Quick aggregation of statistics with dynamically-changing conditions. For example, getting a summary of request traffic volume from the previous night for a web application and draw a graph from it.Trial-and-error data analysis. For example, identifying the cause of trouble and aggregating values by various conditions, including by hour, day and etc...MapReduce: Executing a complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms.Executing large join operations across huge datasets.Exporting large amount of data after processing.

22Examplehttps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo { "id": "0001", "type": "donut", ppu": 0.55, "batters": { "batter: [{ "id": "1001", "type": "Regular" },{ "id": "1002", "type": "Chocolate" },data source: donuts.json query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },logical plan: simple_plan.jsonresult: out.json{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0} { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69} { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55}

StatusHeavy development by multiple organizations

AvailableLogical plan (ADSP)Reference interpreterBasic SQL parser Basic demoBasic HBase back-endStatusApril 2013

Extend SQL syntaxPhysical planIn-memory compressed data interfacesDistributed executionContributingLearn where and how to contributehttps://cwiki.apache.org/confluence/display/DRILL/Contributing

Jira, Git, Apache build and test tools

Preparing for dependenciesHazelcastNetflix Curator26ContributingGeneral contributions appreciated:

Supersonic (?)Test data & test queriesUse case scenarios (textual desc./SQL queries)Documentation27ContributingDremel-inspired columnar formatTwitters Parquet Hives ORC file

Integration with Hive metastore (?)

DRILL-13 Storage Engine: Define Java Interface

DRILL-15 Build HBase storage engine implementation

28ContributingDRILL-48 RPC interface for query submission and physical plan execution

DRILL-53 Setup cluster configuration and membership mgmt system

Further scheduleAlpha Q2Beta Q329Kudos to Julian Hyde, Pentaho Lisen MuTim Chen, MicrosoftChris Merrick, RJMetrics David Alves, UT AustinSree Vaadi, SSS/NGDataJacques Nadeau, MapRTed Dunning, MapREngage!Follow @ApacheDrill on Twitter

Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html

Standing G+ hangouts every Tuesday at 18:00 CET

Keep an eye on http://drill-user.org/

apache drill status

Documents

hadoopquery data

transaction data

query planning

different data sourcessupport

particular query

data localitycoordination

impala interactive query

cached query plans