apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015
TRANSCRIPT
© 2015 Dremio Corporation1
Drill Use Cases & Roadmap
Hadoop Meetup NYCSeptember 28, 2015
© 2015 Dremio Corporation2
Agenda• Drill Background• Common Use Cases• Roadmap
© 2015 Dremio Corporation3
About Dremio• Currently in Stealth• Founded in June 2015• Building on open source technologies
including Drill, Parquet, Spark
Jacques NadeauFounder & CTO
• Apache Drill PMC Chair• Recognized SQL & NoSQL expert
• Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT)
Tomer ShiranFounder & CEO
• Apache Drill Founder• MapR (VP Product); Microsoft; IBM Research
• Carnegie Mellon, Technion
Julien Le DemArchitect
• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs
© 2015 Dremio Corporation4
Apache Drill• Apache Foundation Project• More than 40 contributors from 10
companies• Open source SQL query engine for
relational & non-relational datastores• Designed for Interactive Queries• Scales from one laptop to 1000s of servers
© 2015 Dremio Corporation5
Drill Integrates With What You HaveAny Datastore (Relational or Not)• File systems
– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google Cloud
Storage, Azure Blob Storage• NoSQL databases
– MongoDB– HBase– MapR-DB– Hive
• And you can add new datastores:
Any Client• Multiple interfaces: ODBC, JDBC, REST,
C, Java• BI tools
– Tableau, Qlik, MicroStrategy, TIBCO Spotfire
• Excel• Command line (Drill shell)• Web and mobile apps via REST API
– Many JSON-powered chart libraries (see D3.js)
• SAS, R, …
© 2015 Dremio Corporation6
Drill is Built for Modern Analytical OrganizationsExecute Fast• Standard SQL• Read data fast• Leverage columnar
encodings and execution
• Execute operations quickly
• Scale out, not up
Iterate Fast• Work without prep• Decentralize data
management• In-situ security• Explore + query• Access multiple
sources• Avoid the ETL rinse
cycle
© 2015 Dremio Corporation7
JSON Model, Columnar Speed
JSONBSON
Mongo
HbaseNoSQL
ParquetAvro
CSVTSV
Schema-lessFixed schema
Flat
Complex
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2015 Dremio Corporation8
Drill Provides the Best of Both WorldsActs Like a Database• ANSI SQL: SELECT, FROM, WHERE,
JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME
• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.
• Subqueries, scalar subqueries, partition pruning, CTE
• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore
Even When Your Data Doesn’t• Path based queries and wildcards
– select * from /my/logs/– select * from /revenue/*/q2
• Modern data types– Map, Array, Any
• Complex Functions and Relational Operators
– FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc
• JSON Sensor analytics• Complex data analysis• Alternative DSLs
© 2015 Dremio Corporation9
Data Lake, More Like Data Maelstrom
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBaseWindows Desktop
Mac Desktop
HBase & HDFS Cluster
HDFS ClusterMongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
© 2015 Dremio Corporation10
Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbitDrillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows Desktop
Drillbit
Mac Desktop
Drillbit
© 2015 Dremio Corporation11
EXAMPLES USE CASES
© 2015 Dremio Corporation12
Interactive Query for Hadoop• Challenge:
– Pre-existing workflow with Hive as data warehouse
– Analysts frustrated with query completion times
• Solution:– Partition cluster between interactive and
batch (MR/Spark) workloads– Install Drill on interactive nodes– Utilize Drill’s Hive metastore integration
to expose existing datasets at interactive speeds
ODBC & JDBC BI ToolsDrill JDBC Drill ODBC
Hive Metastore
© 2015 Dremio Corporation13
SQL for NoSQL• Challenge:
– Data has been moved from Oracle to MongoDB
– Business Users unable to use Tableau
• Solution:– Install Drill on each MongoDB Node– Use Drill’s ODBC driver and powerful
parallelization capabilities to provide interactive in-situ query capabilities
Drill ODBC
© 2015 Dremio Corporation14
Data Warehouse Offload• Challenge
– Capacity-based Data Warehouse license constraints
– Data inflow rate too high– Need broader time horizons
• Solution:– Export data from traditional
data warehouse– Load data into Hadoop– Use Drill on top of Hadoop
nodes Existing Warehouse (Teradata, Vertica, Netezza)
ODBC & JDBC BI ToolsDrill JDBC Drill ODBC
© 2015 Dremio Corporation15
S3
Cloud JSON & Sensor Analytics• Problem:
– Sensors logging into S3– Complex JSON: map keys have meaning
and can change for every record• Solution:
– Spin Up EC2 Analysis Cluster– Set up number of S3 Workspaces in Drill– Leverage Drill’s FLATTEN and KVGEN
capabilities to access key-based data– Expose Data via REST API to custom
application
EC2 Node
JSONJSON
JSONJSON
EC2 Node
EC2 NodeEC2 Node
Rest API
Custom Reporting Application
© 2015 Dremio Corporation16
Secure SQL for Everything• Challenge:
– Data in MongoDB, Hadoop and RDBMS– Provide a single endpoint for SQL-based access – Ensure that different users and groups have different
access• Solution:
– Setup Drill leveraging chained security– Use Drill views to expose row-level access control– Leverage Drill User, Group and PAM integrations to
control column filtering and masking– Expose data utilizing JDBC, ODBC and REST apis
MaskedSales.viewowner: Cindy
GrossSales.viewowner: dba
RawSales.parquetowner: dba
Frank file view perm
Cindy file view perm
dba delegated read
Query by Frank
MySQL
Drill ODBC
© 2015 Dremio Corporation17
ETL among modern systems and Formats• Challenge:
– Data held in MongoDB & JSON based log files.
– Want to run large scale machine learning against data
• Solution: – Do simple CTAS query in Drill– Converts data into high performance
Parquet format – Large-scale parallel conversion and load
into Hadoop
© 2015 Dremio Corporation18
UPCOMING FEATURES
© 2015 Dremio Corporation19
Access Data in More PlacesNoSQLAvailableHBaseMongoDBMapRDB
SoonElasticsearchCassandraSolr
AvailableHiveJDBCMySQL
SoonPhoenixPostgresOracle
RDBMSAvailableHDFSMapR-FSS3NASAzure
SoonDistDAS
File Systems Available
JSONParquetText & CSVAvroHive Serdes
SoonExcelHTTPD LogBSON
File Formats
© 2015 Dremio Corporation20
Enhanced Flexibility• JSON literals in SQL• Improved dynamic schema
capabilities• Type tools• Transformation UDFs
© 2015 Dremio Corporation21
Improved Management• WebUI Authentication• Web & SQL Authorization• Advanced Workload Management• Enhanced spooling/memory
capabilities
© 2015 Dremio Corporation22
Query Data in More Ways• Spark Dataframe & RDD: Read & Write
– Use Drill to work with NoSQL in your Spark Pipeline
• MapReduce Input and Output Formats– Use Drill
• Enhanced Rest Capabilities– Better support for Complex Data
© 2015 Dremio Corporation23
Performance• Faster Parquet and ORC readers• Parquet enhancements
– More statistics, bloomfilters, indexes, page pruning
• Pin Tables in Memory• Vectorization Enhancements
© 2015 Dremio Corporation24
Recap• Drill Background
– SQL on Everything• Common Use Cases
– Interactive analyst experience, no matter where the data exists
• Roadmap– More data, more access, more flexibility,
easier management and higher performance
© 2015 Dremio Corporation25
Thank You!
• Download at drill.apache.org
• Get in touch:• [email protected]
• Ask questions:• [email protected]
• Tweet• @DremioHQ, @ApacheDrill