apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015

© 2015 Dremio Corporation1

Drill Use Cases & Roadmap

Hadoop Meetup NYCSeptember 28, 2015


Agenda• Drill Background• Common Use Cases• Roadmap


About Dremio• Currently in Stealth• Founded in June 2015• Building on open source technologies

including Drill, Parquet, Spark

Jacques NadeauFounder & CTO

• Apache Drill PMC Chair• Recognized SQL & NoSQL expert

• Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT)

Tomer ShiranFounder & CEO

• Apache Drill Founder• MapR (VP Product); Microsoft; IBM Research

• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs


Apache Drill• Apache Foundation Project• More than 40 contributors from 10

companies• Open source SQL query engine for

relational & non-relational datastores• Designed for Interactive Queries• Scales from one laptop to 1000s of servers


Drill Integrates With What You HaveAny Datastore (Relational or Not)• File systems

– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google Cloud

Storage, Azure Blob Storage• NoSQL databases

– MongoDB– HBase– MapR-DB– Hive

• And you can add new datastores:

Any Client• Multiple interfaces: ODBC, JDBC, REST,

C, Java• BI tools

– Tableau, Qlik, MicroStrategy, TIBCO Spotfire

• Excel• Command line (Drill shell)• Web and mobile apps via REST API

– Many JSON-powered chart libraries (see D3.js)

• SAS, R, …


Drill is Built for Modern Analytical OrganizationsExecute Fast• Standard SQL• Read data fast• Leverage columnar

encodings and execution

• Execute operations quickly

• Scale out, not up

Iterate Fast• Work without prep• Decentralize data

management• In-situ security• Explore + query• Access multiple

sources• Avoid the ETL rinse

cycle


JSON Model, Columnar Speed

JSONBSON

Mongo

HbaseNoSQL

ParquetAvro

CSVTSV

Schema-lessFixed schema

Flat

Complex

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table


Drill Provides the Best of Both WorldsActs Like a Database• ANSI SQL: SELECT, FROM, WHERE,

JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME

• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.

• Subqueries, scalar subqueries, partition pruning, CTE

• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore

Even When Your Data Doesn’t• Path based queries and wildcards

– select * from /my/logs/– select * from /revenue/*/q2

• Modern data types– Map, Array, Any

• Complex Functions and Relational Operators

– FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc

• JSON Sensor analytics• Complex data analysis• Alternative DSLs


Data Lake, More Like Data Maelstrom

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Cassandra Cassandra

HDFS

HDFS

HBaseWindows Desktop

Mac Desktop

HBase & HDFS Cluster

HDFS ClusterMongoDB Cluster

Cassandra Cluster

DesktopClustered Servers


Run Drillbits Wherever; Whatever Your Data

Drillbit

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Drillbit

DrillbitDrillbitDrillbit Drillbit

Cassandra Cassandra

Drillbit Drillbit

HDFS

HDFS

HBase

Drillbit

Drillbit

Windows Desktop

Drillbit

Mac Desktop

Drillbit


EXAMPLES USE CASES


Interactive Query for Hadoop• Challenge:

– Pre-existing workflow with Hive as data warehouse

– Analysts frustrated with query completion times

• Solution:– Partition cluster between interactive and

batch (MR/Spark) workloads– Install Drill on interactive nodes– Utilize Drill’s Hive metastore integration

to expose existing datasets at interactive speeds

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC

Hive Metastore


SQL for NoSQL• Challenge:

– Data has been moved from Oracle to MongoDB

– Business Users unable to use Tableau

• Solution:– Install Drill on each MongoDB Node– Use Drill’s ODBC driver and powerful

parallelization capabilities to provide interactive in-situ query capabilities

Drill ODBC


Data Warehouse Offload• Challenge

– Capacity-based Data Warehouse license constraints

– Data inflow rate too high– Need broader time horizons

• Solution:– Export data from traditional

data warehouse– Load data into Hadoop– Use Drill on top of Hadoop

nodes Existing Warehouse (Teradata, Vertica, Netezza)

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC


S3

Cloud JSON & Sensor Analytics• Problem:

– Sensors logging into S3– Complex JSON: map keys have meaning

and can change for every record• Solution:

– Spin Up EC2 Analysis Cluster– Set up number of S3 Workspaces in Drill– Leverage Drill’s FLATTEN and KVGEN

capabilities to access key-based data– Expose Data via REST API to custom

application

EC2 Node

JSONJSON

JSONJSON

EC2 Node

EC2 NodeEC2 Node

Rest API

Custom Reporting Application


Secure SQL for Everything• Challenge:

– Data in MongoDB, Hadoop and RDBMS– Provide a single endpoint for SQL-based access – Ensure that different users and groups have different

access• Solution:

– Setup Drill leveraging chained security– Use Drill views to expose row-level access control– Leverage Drill User, Group and PAM integrations to

control column filtering and masking– Expose data utilizing JDBC, ODBC and REST apis

MaskedSales.viewowner: Cindy

GrossSales.viewowner: dba

RawSales.parquetowner: dba

Frank file view perm

Cindy file view perm

dba delegated read

Query by Frank

MySQL

Drill ODBC


ETL among modern systems and Formats• Challenge:

– Data held in MongoDB & JSON based log files.

– Want to run large scale machine learning against data

• Solution: – Do simple CTAS query in Drill– Converts data into high performance

Parquet format – Large-scale parallel conversion and load

into Hadoop


UPCOMING FEATURES


Access Data in More PlacesNoSQLAvailableHBaseMongoDBMapRDB

SoonElasticsearchCassandraSolr

AvailableHiveJDBCMySQL

SoonPhoenixPostgresOracle

RDBMSAvailableHDFSMapR-FSS3NASAzure

SoonDistDAS

File Systems Available

JSONParquetText & CSVAvroHive Serdes

SoonExcelHTTPD LogBSON

File Formats


Enhanced Flexibility• JSON literals in SQL• Improved dynamic schema

capabilities• Type tools• Transformation UDFs


Improved Management• WebUI Authentication• Web & SQL Authorization• Advanced Workload Management• Enhanced spooling/memory

capabilities


Query Data in More Ways• Spark Dataframe & RDD: Read & Write

– Use Drill to work with NoSQL in your Spark Pipeline

• MapReduce Input and Output Formats– Use Drill

• Enhanced Rest Capabilities– Better support for Complex Data


Performance• Faster Parquet and ORC readers• Parquet enhancements

– More statistics, bloomfilters, indexes, page pruning

• Pin Tables in Memory• Vectorization Enhancements


Recap• Drill Background

– SQL on Everything• Common Use Cases

– Interactive analyst experience, no matter where the data exists

• Roadmap– More data, more access, more flexibility,

easier management and higher performance


Thank You!

• Download at drill.apache.org

• Get in touch:• [email protected]

• Ask questions:• [email protected]

• Tweet• @DremioHQ, @ApacheDrill

http://drill.apache.org/

mailto:[email protected]

mailto:[email protected]

apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015

Software