apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015

25
© 2015 Dremio Corporation 1 Drill Use Cases & Roadmap Hadoop Meetup NYC September 28, 2015

Upload: dremio

Post on 09-Jan-2017

1.832 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation1

Drill Use Cases & Roadmap

Hadoop Meetup NYCSeptember 28, 2015

Page 2: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation2

Agenda• Drill Background• Common Use Cases• Roadmap

Page 3: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation3

About Dremio• Currently in Stealth• Founded in June 2015• Building on open source technologies

including Drill, Parquet, Spark

Jacques NadeauFounder & CTO

• Apache Drill PMC Chair• Recognized SQL & NoSQL expert

• Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT)

Tomer ShiranFounder & CEO

• Apache Drill Founder• MapR (VP Product); Microsoft; IBM Research

• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs

Page 4: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation4

Apache Drill• Apache Foundation Project• More than 40 contributors from 10

companies• Open source SQL query engine for

relational & non-relational datastores• Designed for Interactive Queries• Scales from one laptop to 1000s of servers

Page 5: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation5

Drill Integrates With What You HaveAny Datastore (Relational or Not)• File systems

– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google Cloud

Storage, Azure Blob Storage• NoSQL databases

– MongoDB– HBase– MapR-DB– Hive

• And you can add new datastores:

Any Client• Multiple interfaces: ODBC, JDBC, REST,

C, Java• BI tools

– Tableau, Qlik, MicroStrategy, TIBCO Spotfire

• Excel• Command line (Drill shell)• Web and mobile apps via REST API

– Many JSON-powered chart libraries (see D3.js)

• SAS, R, …

Page 6: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation6

Drill is Built for Modern Analytical OrganizationsExecute Fast• Standard SQL• Read data fast• Leverage columnar

encodings and execution

• Execute operations quickly

• Scale out, not up

Iterate Fast• Work without prep• Decentralize data

management• In-situ security• Explore + query• Access multiple

sources• Avoid the ETL rinse

cycle

Page 7: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation7

JSON Model, Columnar Speed

JSONBSON

Mongo

HbaseNoSQL

ParquetAvro

CSVTSV

Schema-lessFixed schema

Flat

Complex

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 8: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation8

Drill Provides the Best of Both WorldsActs Like a Database• ANSI SQL: SELECT, FROM, WHERE,

JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME

• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.

• Subqueries, scalar subqueries, partition pruning, CTE

• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore

Even When Your Data Doesn’t• Path based queries and wildcards

– select * from /my/logs/– select * from /revenue/*/q2

• Modern data types– Map, Array, Any

• Complex Functions and Relational Operators

– FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc

• JSON Sensor analytics• Complex data analysis• Alternative DSLs

Page 9: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation9

Data Lake, More Like Data Maelstrom

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Cassandra Cassandra

HDFS

HDFS

HBaseWindows Desktop

Mac Desktop

HBase & HDFS Cluster

HDFS ClusterMongoDB Cluster

Cassandra Cluster

DesktopClustered Servers

Page 10: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation10

Run Drillbits Wherever; Whatever Your Data

Drillbit

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Drillbit

DrillbitDrillbitDrillbit Drillbit

Cassandra Cassandra

Drillbit Drillbit

HDFS

HDFS

HBase

Drillbit

Drillbit

Windows Desktop

Drillbit

Mac Desktop

Drillbit

Page 11: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation11

EXAMPLES USE CASES

Page 12: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation12

Interactive Query for Hadoop• Challenge:

– Pre-existing workflow with Hive as data warehouse

– Analysts frustrated with query completion times

• Solution:– Partition cluster between interactive and

batch (MR/Spark) workloads– Install Drill on interactive nodes– Utilize Drill’s Hive metastore integration

to expose existing datasets at interactive speeds

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC

Hive Metastore

Page 13: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation13

SQL for NoSQL• Challenge:

– Data has been moved from Oracle to MongoDB

– Business Users unable to use Tableau

• Solution:– Install Drill on each MongoDB Node– Use Drill’s ODBC driver and powerful

parallelization capabilities to provide interactive in-situ query capabilities

Drill ODBC

Page 14: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation14

Data Warehouse Offload• Challenge

– Capacity-based Data Warehouse license constraints

– Data inflow rate too high– Need broader time horizons

• Solution:– Export data from traditional

data warehouse– Load data into Hadoop– Use Drill on top of Hadoop

nodes Existing Warehouse (Teradata, Vertica, Netezza)

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC

Page 15: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation15

S3

Cloud JSON & Sensor Analytics• Problem:

– Sensors logging into S3– Complex JSON: map keys have meaning

and can change for every record• Solution:

– Spin Up EC2 Analysis Cluster– Set up number of S3 Workspaces in Drill– Leverage Drill’s FLATTEN and KVGEN

capabilities to access key-based data– Expose Data via REST API to custom

application

EC2 Node

JSONJSON

JSONJSON

EC2 Node

EC2 NodeEC2 Node

Rest API

Custom Reporting Application

Page 16: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation16

Secure SQL for Everything• Challenge:

– Data in MongoDB, Hadoop and RDBMS– Provide a single endpoint for SQL-based access – Ensure that different users and groups have different

access• Solution:

– Setup Drill leveraging chained security– Use Drill views to expose row-level access control– Leverage Drill User, Group and PAM integrations to

control column filtering and masking– Expose data utilizing JDBC, ODBC and REST apis

MaskedSales.viewowner: Cindy

GrossSales.viewowner: dba

RawSales.parquetowner: dba

Frank file view perm

Cindy file view perm

dba delegated read

Query by Frank

MySQL

Drill ODBC

Page 17: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation17

ETL among modern systems and Formats• Challenge:

– Data held in MongoDB & JSON based log files.

– Want to run large scale machine learning against data

• Solution: – Do simple CTAS query in Drill– Converts data into high performance

Parquet format – Large-scale parallel conversion and load

into Hadoop

Page 18: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation18

UPCOMING FEATURES

Page 19: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation19

Access Data in More PlacesNoSQLAvailableHBaseMongoDBMapRDB

SoonElasticsearchCassandraSolr

AvailableHiveJDBCMySQL

SoonPhoenixPostgresOracle

RDBMSAvailableHDFSMapR-FSS3NASAzure

SoonDistDAS

File Systems Available

JSONParquetText & CSVAvroHive Serdes

SoonExcelHTTPD LogBSON

File Formats

Page 20: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation20

Enhanced Flexibility• JSON literals in SQL• Improved dynamic schema

capabilities• Type tools• Transformation UDFs

Page 21: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation21

Improved Management• WebUI Authentication• Web & SQL Authorization• Advanced Workload Management• Enhanced spooling/memory

capabilities

Page 22: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation22

Query Data in More Ways• Spark Dataframe & RDD: Read & Write

– Use Drill to work with NoSQL in your Spark Pipeline

• MapReduce Input and Output Formats– Use Drill

• Enhanced Rest Capabilities– Better support for Complex Data

Page 23: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation23

Performance• Faster Parquet and ORC readers• Parquet enhancements

– More statistics, bloomfilters, indexes, page pruning

• Pin Tables in Memory• Vectorization Enhancements

Page 24: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation24

Recap• Drill Background

– SQL on Everything• Common Use Cases

– Interactive analyst experience, no matter where the data exists

• Roadmap– More data, more access, more flexibility,

easier management and higher performance

Page 25: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015

© 2015 Dremio Corporation25

Thank You!

• Download at drill.apache.org

• Get in touch:• [email protected]

• Ask questions:• [email protected]

• Tweet• @DremioHQ, @ApacheDrill