apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015

Post on 09-Jan-2017

1.832 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2015 Dremio Corporation1

Drill Use Cases & Roadmap

Hadoop Meetup NYCSeptember 28, 2015

© 2015 Dremio Corporation2

Agenda• Drill Background• Common Use Cases• Roadmap

© 2015 Dremio Corporation3

About Dremio• Currently in Stealth• Founded in June 2015• Building on open source technologies

including Drill, Parquet, Spark

Jacques NadeauFounder & CTO

• Apache Drill PMC Chair• Recognized SQL & NoSQL expert

• Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT)

Tomer ShiranFounder & CEO

• Apache Drill Founder• MapR (VP Product); Microsoft; IBM Research

• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs

© 2015 Dremio Corporation4

Apache Drill• Apache Foundation Project• More than 40 contributors from 10

companies• Open source SQL query engine for

relational & non-relational datastores• Designed for Interactive Queries• Scales from one laptop to 1000s of servers

© 2015 Dremio Corporation5

Drill Integrates With What You HaveAny Datastore (Relational or Not)• File systems

– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google Cloud

Storage, Azure Blob Storage• NoSQL databases

– MongoDB– HBase– MapR-DB– Hive

• And you can add new datastores:

Any Client• Multiple interfaces: ODBC, JDBC, REST,

C, Java• BI tools

– Tableau, Qlik, MicroStrategy, TIBCO Spotfire

• Excel• Command line (Drill shell)• Web and mobile apps via REST API

– Many JSON-powered chart libraries (see D3.js)

• SAS, R, …

© 2015 Dremio Corporation6

Drill is Built for Modern Analytical OrganizationsExecute Fast• Standard SQL• Read data fast• Leverage columnar

encodings and execution

• Execute operations quickly

• Scale out, not up

Iterate Fast• Work without prep• Decentralize data

management• In-situ security• Explore + query• Access multiple

sources• Avoid the ETL rinse

cycle

© 2015 Dremio Corporation7

JSON Model, Columnar Speed

JSONBSON

Mongo

HbaseNoSQL

ParquetAvro

CSVTSV

Schema-lessFixed schema

Flat

Complex

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

© 2015 Dremio Corporation8

Drill Provides the Best of Both WorldsActs Like a Database• ANSI SQL: SELECT, FROM, WHERE,

JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME

• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.

• Subqueries, scalar subqueries, partition pruning, CTE

• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore

Even When Your Data Doesn’t• Path based queries and wildcards

– select * from /my/logs/– select * from /revenue/*/q2

• Modern data types– Map, Array, Any

• Complex Functions and Relational Operators

– FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc

• JSON Sensor analytics• Complex data analysis• Alternative DSLs

© 2015 Dremio Corporation9

Data Lake, More Like Data Maelstrom

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Cassandra Cassandra

HDFS

HDFS

HBaseWindows Desktop

Mac Desktop

HBase & HDFS Cluster

HDFS ClusterMongoDB Cluster

Cassandra Cluster

DesktopClustered Servers

© 2015 Dremio Corporation10

Run Drillbits Wherever; Whatever Your Data

Drillbit

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Drillbit

DrillbitDrillbitDrillbit Drillbit

Cassandra Cassandra

Drillbit Drillbit

HDFS

HDFS

HBase

Drillbit

Drillbit

Windows Desktop

Drillbit

Mac Desktop

Drillbit

© 2015 Dremio Corporation11

EXAMPLES USE CASES

© 2015 Dremio Corporation12

Interactive Query for Hadoop• Challenge:

– Pre-existing workflow with Hive as data warehouse

– Analysts frustrated with query completion times

• Solution:– Partition cluster between interactive and

batch (MR/Spark) workloads– Install Drill on interactive nodes– Utilize Drill’s Hive metastore integration

to expose existing datasets at interactive speeds

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC

Hive Metastore

© 2015 Dremio Corporation13

SQL for NoSQL• Challenge:

– Data has been moved from Oracle to MongoDB

– Business Users unable to use Tableau

• Solution:– Install Drill on each MongoDB Node– Use Drill’s ODBC driver and powerful

parallelization capabilities to provide interactive in-situ query capabilities

Drill ODBC

© 2015 Dremio Corporation14

Data Warehouse Offload• Challenge

– Capacity-based Data Warehouse license constraints

– Data inflow rate too high– Need broader time horizons

• Solution:– Export data from traditional

data warehouse– Load data into Hadoop– Use Drill on top of Hadoop

nodes Existing Warehouse (Teradata, Vertica, Netezza)

ODBC & JDBC BI ToolsDrill JDBC Drill ODBC

© 2015 Dremio Corporation15

S3

Cloud JSON & Sensor Analytics• Problem:

– Sensors logging into S3– Complex JSON: map keys have meaning

and can change for every record• Solution:

– Spin Up EC2 Analysis Cluster– Set up number of S3 Workspaces in Drill– Leverage Drill’s FLATTEN and KVGEN

capabilities to access key-based data– Expose Data via REST API to custom

application

EC2 Node

JSONJSON

JSONJSON

EC2 Node

EC2 NodeEC2 Node

Rest API

Custom Reporting Application

© 2015 Dremio Corporation16

Secure SQL for Everything• Challenge:

– Data in MongoDB, Hadoop and RDBMS– Provide a single endpoint for SQL-based access – Ensure that different users and groups have different

access• Solution:

– Setup Drill leveraging chained security– Use Drill views to expose row-level access control– Leverage Drill User, Group and PAM integrations to

control column filtering and masking– Expose data utilizing JDBC, ODBC and REST apis

MaskedSales.viewowner: Cindy

GrossSales.viewowner: dba

RawSales.parquetowner: dba

Frank file view perm

Cindy file view perm

dba delegated read

Query by Frank

MySQL

Drill ODBC

© 2015 Dremio Corporation17

ETL among modern systems and Formats• Challenge:

– Data held in MongoDB & JSON based log files.

– Want to run large scale machine learning against data

• Solution: – Do simple CTAS query in Drill– Converts data into high performance

Parquet format – Large-scale parallel conversion and load

into Hadoop

© 2015 Dremio Corporation18

UPCOMING FEATURES

© 2015 Dremio Corporation19

Access Data in More PlacesNoSQLAvailableHBaseMongoDBMapRDB

SoonElasticsearchCassandraSolr

AvailableHiveJDBCMySQL

SoonPhoenixPostgresOracle

RDBMSAvailableHDFSMapR-FSS3NASAzure

SoonDistDAS

File Systems Available

JSONParquetText & CSVAvroHive Serdes

SoonExcelHTTPD LogBSON

File Formats

© 2015 Dremio Corporation20

Enhanced Flexibility• JSON literals in SQL• Improved dynamic schema

capabilities• Type tools• Transformation UDFs

© 2015 Dremio Corporation21

Improved Management• WebUI Authentication• Web & SQL Authorization• Advanced Workload Management• Enhanced spooling/memory

capabilities

© 2015 Dremio Corporation22

Query Data in More Ways• Spark Dataframe & RDD: Read & Write

– Use Drill to work with NoSQL in your Spark Pipeline

• MapReduce Input and Output Formats– Use Drill

• Enhanced Rest Capabilities– Better support for Complex Data

© 2015 Dremio Corporation23

Performance• Faster Parquet and ORC readers• Parquet enhancements

– More statistics, bloomfilters, indexes, page pruning

• Pin Tables in Memory• Vectorization Enhancements

© 2015 Dremio Corporation24

Recap• Drill Background

– SQL on Everything• Common Use Cases

– Interactive analyst experience, no matter where the data exists

• Roadmap– More data, more access, more flexibility,

easier management and higher performance

© 2015 Dremio Corporation25

Thank You!

• Download at drill.apache.org

• Get in touch:• jacques@dremio.com

• Ask questions:• user@drill.apache.org

• Tweet• @DremioHQ, @ApacheDrill

top related