apache drill overview - tokyo apache drill meetup 2015/09/15

63
® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Apache Drill Overview M.C. Srivas – CTO and Co-Founder, MapR Technologies 草薙 昭彦 – Data Engineer, MapR Technologies 2015 9 15

Upload: mapr-technologies-japan

Post on 13-Feb-2017

1.774 views

Category:

Data & Analytics


0 download

TRANSCRIPT

  • 2015 MapR Technologies 1

    2015 MapR Technologies

    Apache Drill Overview

    M.C. Srivas CTO and Co-Founder, MapR Technologies Data Engineer, MapR Technologies 2015 9 15

  • 2015 MapR Technologies 2

    (@nagix) MapR Technologies

    NS-SHAFT

    !

  • 2015 MapR Technologies 3

  • 2015 MapR Technologies 4

    Apache Drill 1.0 (5/19) http://drill.apache.org

  • 2015 MapR Technologies 5

    Apache Drill

  • 2015 MapR Technologies 6 2015 MapR Technologies

    Apache Drill

  • 2015 MapR Technologies 7

    1980 2000 2010 1990 2020

    80%

    : Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

  • 2015 MapR Technologies 8

    1980 2000 2010 1990 2020

    DB

    GBTB TBPB

  • 2015 MapR Technologies 9

    SQL

    SQL NoSQL

    SQL

    BI (TableauMicroStrategy )

    HDFS (ParquetJSON ) HBase

  • 2015 MapR Technologies 10

    Industry's First Schema-free SQL engine

    for Big Data

  • 2015 MapR Technologies 11

    &

    BI

    ITBI

    BI

    ITBI ITBI

    BI

    IT

    IT ETL

    IT

    1980 -1990 2000

  • 2015 MapR Technologies 12

    Hadoop

    Hadoop

    :

    :

  • 2015 MapR Technologies 13

    Drill

    (Hive )

    2

    SCHEMA ON WRITE

    SCHEMA BEFORE READ

    SCHEMA ON THE FLY

  • 2015 MapR Technologies 14

    Drill

    JSON BSON

    HBase

    Parquet Avro

    CSV TSV

    Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !

    { ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !

    RDBMS/SQL-on-Hadoop

    Apache Drill

  • 2015 MapR Technologies 15

    - - HBase - Hive

    Drill SQL on Everything

    SELECT * FROM dfs.yelp.`business.json` !

    - - Hive - HBase

    - DFS (Text, Parquet, JSON) - HBase/MapR-DB - Hive /HCatalog - Hadoop API

  • 2015 MapR Technologies 16

    (drillbit)

    (MapReduce, Spark, Tez)

    ZooKeeper drillbit ZooKeeper drillbit ZooKeeper drillbit

  • 2015 MapR Technologies 17

    Drill

    HDFS MapR-FS DataNode drillbit HBase MapR-DB RegionServer drillbit MongoDB mongod drillbit ()

    drillbit

    DataNode/RegionServer/

    mongod

    drillbit

    DataNode/RegionServer/

    mongod

    drillbit

    DataNode/RegionServer/

    mongod

    ZooKeeper ZooKeeper

    ZooKeeper

  • 2015 MapR Technologies 18

    SELECT*

    drillbit ZooKeeper

    (JDBC, ODBC,

    REST)

    1. drillbit

    3. 4.

    ZooKeeper ZooKeeper

    drillbit drillbit

    2. drillbit

    5.

    * CTAS (CREATE TABLE AS SELECT) 14

  • 2015 MapR Technologies 19

    drillbit

    SQL Hive

    HBase

    MongoDB

    DFS

    RPC

  • 2015 MapR Technologies 20 2015 MapR Technologies

  • 2015 MapR Technologies 21

    M.C. Srivas MapR Technologies CTO

    MapReduce, Bigtable

    Netapp

    AFS AFS

  • 2015 MapR Technologies 22

    Drill

    Raw Data Exploration JSON Analytics Data Hub Analytics

    Hive HBase

    {JSON}, Parquet Text

  • 2015 MapR Technologies 23

    IOT

    SaaS Apache Drill JSON BI ODBC

    ETL

  • 2015 MapR Technologies 24

    SQL Hadoop

    MapR Drill PigHiveQLSQL

    Drill Tableau Squirrel

    MapR 1/100 $1,000 / TB MapR Drill BI SQL Hadoop

    SQL $100,000 / TB

    ETL SQL

  • 2015 MapR Technologies 25

    Customer-facing Analytics as a Service Drill

    MapR Drill Drill

    Hadoop SQL

    Drill

    JSON Parquet 10GB4TB 160

    SLA

  • 2015 MapR Technologies 26

    MapR Optimized Data Architecture

    , SaaS,

    , E

    , ,

    , ,

    Data Movement

    Data Access

    BI,

    ,

    Optimized Data Architecture

    MAPR DISTRIBUTION FOR HADOOP

    (Spark Streaming,

    Storm)

    MapR Data Platform MapR-DB

    MAPR DISTRIBUTION FOR HADOOP

    (MapReduce,

    Spark, Hive, Pig)

    MapR-FS

    (Drill,

    Impala)

  • 2015 MapR Technologies 27 2015 MapR Technologies

  • 2015 MapR Technologies 28

    Apache Drill

    Drill Beta(20149 - 20154)

    Drill 1.0(20155)

    Drill 1.1(20157)

    Drill 1.2(20159)

    Drill 1.3()

  • 2015 MapR Technologies 29

    Apache Drill (2015)

    ANSI SQL o (Rank, Row_number,

    OVER, PARTITION BY) o CTAS

    o Hive &

    o Hive UDF o Hive Impersonation o AVRO

    (Beta) JDBC

    Drill 1.1

    ANSI SQL o (Lead, Lag,

    First_Value, Last_value, NTile) o Drop Table

    o Hive

    o Hive

    o MapR-DB

    o

    Drill Web UI

    Drill 1.2 ANSI SQL o Insert/Append

    o

    o o Drill on MapR-DB JSON

    o MapR-DB

    o Parquet

    Drill 1.3

  • 2015 MapR Technologies 30

    Hive BI Hive Hive

    Hive Hive Drill Hive UDF Hive Drill Impersonation

    Hive

    Parquet & Text

    Hive

    Drill

    Drill ODBC

    Drill JDBC

    1.1

    1.2

  • 2015 MapR Technologies 31

    MapR-DB BI (Tableau,

    MicroStrategy, Qlikview, ) MapR-DB KV MapR-DB JSON

    MapR-DB SQL ES

    MapR-DB

    MapR-DB

    Drill

    Drill ODBC

    Drill JDBC

    1.2 1.3

    1.3

  • 2015 MapR Technologies 32

    ANSI SQL

    Count/Avg/Min/Max/Sum Over/Partition By Rank, Dense_Rank, Percent_Rank, Row_Number, Cume_Dist Lead, Lag, First_Value, Last_Value, Ntile

    SQL DDL Parquet Drop table Insert/Append

    1.1

    1.2

    1.1 1.2 1.3

    1.1

  • 2015 MapR Technologies 33

    PAM +

    Impersonation

    Drill View

    JDBC/ODBC

    Web UI Files HBase Hive

    Drill View 1

    Drill View 2

    U U U

    User

    1.2

  • 2015 MapR Technologies 34

    &

    BI

  • 2015 MapR Technologies 35 2015 MapR Technologies

  • 2015 MapR Technologies 36

    Drill (e-Stat)

  • 2015 MapR Technologies 37

    Drill (e-Stat)

    e-Stat Apache Drill http://nagix.hatenablog.com/entry/2015/05/21/232526

  • 2015 MapR Technologies 38

  • 2015 MapR Technologies 39

  • 2015 MapR Technologies 40

    Drill JDK 7 $ wget http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz$ tar -xvzf apache-drill-1.1.0.tar.gz$ apache-drill-1.1.0/bin/drill-embedded0: jdbc:drill:zk=local>

  • 2015 MapR Technologies 41

    $ ls -l

  • 2015 MapR Technologies 42

    README$ cat README

  • 2015 MapR Technologies 43

  • 2015 MapR Technologies 44

    MySQL DROP TABLE IF EXISTS ``;CREATE TABLE `` ( `id` int(11) NOT NULL AUTO_INCREMENT, `createdon` timestamp NULL DEFAULT NULL, `createdby` int(11) DEFAULT NULL, ...) ENGINE=InnoDB AUTO_INCREMENT=36993336 DEFAULT CHARSET=utf8;

    LOCK TABLES `` WRITE;INSERT INTO `` VALUES (9,'2002-01-17 02:15:08',0,'2011-10-14 13:47:31',20,2,2,1,1,0,19630, ... ),( ... ), ... ,( ... );INSERT INTO `` VALUES (2297,'2002-03-19 22:13:14',0,'2011-10-14 15:47:29',11,3,2,1,2,0,21891, ... ),( ... ), ... ,( ... );...

  • 2015 MapR Technologies 45

    MySQL DROP TABLE IF EXISTS ``;CREATE TABLE `` ( `id` int(11) NOT NULL AUTO_INCREMENT, `createdon` timestamp NULL DEFAULT NULL, `createdby` int(11) DEFAULT NULL, ...) ENGINE=InnoDB AUTO_INCREMENT=36993336 DEFAULT CHARSET=utf8;

    LOCK TABLES `` WRITE;INSERT INTO `` VALUES (9,'2002-01-17 02:15:08',0,'2011-10-14 13:47:31',20,2,2,1,1,0,19630, ... ),( ... ), ... ,( ... );INSERT INTO `` VALUES (2297,'2002-03-19 22:13:14',0,'2011-10-14 15:47:29',11,3,2,1,2,0,21891, ... ),( ... ), ... ,( ... );...

    CSV

  • 2015 MapR Technologies 46

    MySQL CSV #!/usr/bin/perl

    while () { s/^(--|\/\*| |\)|DROP|CREATE|LOCK).*//g; # s/^INSERT INTO .+ VALUES \(//g; # INSERT s/(?

  • 2015 MapR Technologies 47

    CSV SELECT

    3197

    0: jdbc:drill:zk=local> SELECT count(*) FROM dfs.`/tmp/.csv`;.csv`;+-----------+| EXPR$0 |+-----------+| 31971575 |+-----------+1 row selected (32.733 seconds)

  • 2015 MapR Technologies 48

    CSV SELECT

    CSV columns [a,b,...]

    0: jdbc:drill:zk=local> !set maxwidth 1600: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/.csv` LIMIT 3;+---------+| columns |+---------+| ["9","2002-01-17 02:15:08","0","2011-10-14 13:47:31","20","2","2","1","1","0","19630","","",""," Ave.","Suite ","To || ["10","2002-01-17 02:22:35","0","2011-10-14 13:47:31","10","2","3","2","2","0","19631","","",""," Ave","","York Region"," || ["11","2002-01-17 20:17:27","0","2011-10-14 13:47:32","0","2","2","1","2","0","19632","","","","","","Toronto",""," |+---------+3 rows selected (0.564 seconds)

  • 2015 MapR Technologies 49

    CSV SELECT

    columns[0], columns[1]

    0: jdbc:drill:zk=local> SELECT columns[0], columns[1], columns[2], columns[3], columns[4] FROM dfs.`/tmp/.csv` LIMIT 3;+---------+----------------------+---------+----------------------+---------+| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 | EXPR$4 |+---------+----------------------+---------+----------------------+---------+| 9 | 2002-01-17 02:15:08 | 0 | 2011-10-14 13:47:31 | 20 || 10 | 2002-01-17 02:22:35 | 0 | 2011-10-14 13:47:31 | 10 || 11 | 2002-01-17 20:17:27 | 0 | 2011-10-14 13:47:32 | 0 |+---------+----------------------+---------+----------------------+---------+3 rows selected (0.356 seconds)

  • 2015 MapR Technologies 50

    CSV SELECT

    MySQL

    0: jdbc:drill:zk=local> SELECT columns[0] AS id, columns[1] AS createdon, columns[2] AS createdby, columns[3] AS updatedon, columns[4] AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;+-----+----------------------+------------+----------------------+------------+| id | createdon | createdby | updatedon | updatedby |+-----+----------------------+------------+----------------------+------------+| 9 | 2002-01-17 02:15:08 | 0 | 2011-10-14 13:47:31 | 20 || 10 | 2002-01-17 02:22:35 | 0 | 2011-10-14 13:47:31 | 10 || 11 | 2002-01-17 20:17:27 | 0 | 2011-10-14 13:47:32 | 0 |+-----+----------------------+------------+----------------------+------------+3 rows selected (0.327 seconds)

  • 2015 MapR Technologies 51

    CSV SELECT

    CSV VARCHAR CAST( AS )

    :

    0: jdbc:drill:zk=local> SELECT CAST(columns[0] AS INT) AS id, CAST(columns[1] AS TIMESTAMP) AS createdon, CAST(columns[2] AS INT) AS createdby, CAST(columns[3] AS TIMESTAMP) AS updatedon, CAST(columns[4] AS INT) AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;Error: SYSTEM ERROR: NumberFormatException:

    Fragment 1:2

    [Error Id: 33d800c9-78ea-473a-8e41-b13e38307af3 on node1:31010] (state=,code=0)

  • 2015 MapR Technologies 52

    CSV NULL 1: CASE

    2:

    CASE WHEN columns[2] = '' THEN NULL ELSE CAST(columns[2] AS INT)END

    0: jdbc:drill:zk=local> ALTER SYSTEM SET `drill.exec.functions.cast_empty_string_to_null` = true;+-------+----------------------------------------------------------+| ok | summary |+-------+----------------------------------------------------------+| true | drill.exec.functions.cast_empty_string_to_null updated. |+-------+----------------------------------------------------------+

  • 2015 MapR Technologies 53

    CSV SELECT 2 0: jdbc:drill:zk=local> SELECT CAST(columns[0] AS INT) AS id, CAST(columns[1] AS TIMESTAMP) AS createdon, CAST(columns[2] AS INT) AS createdby, CAST(columns[3] AS TIMESTAMP) AS updatedon, CAST(columns[4] AS INT) AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;+-----+------------------------+------------+------------------------+------------+| id | createdon | createdby | updatedon | updatedby |+-----+------------------------+------------+------------------------+------------+| 9 | 2002-01-17 02:15:08.0 | 0 | 2011-10-14 13:47:31.0 | 20 || 10 | 2002-01-17 02:22:35.0 | 0 | 2011-10-14 13:47:31.0 | 10 || 11 | 2002-01-17 20:17:27.0 | 0 | 2011-10-14 13:47:32.0 | 0 |+-----+------------------------+------------+------------------------+------------+3 rows selected (0.734 seconds)

  • 2015 MapR Technologies 54

    25

    1 2

    0: jdbc:drill:zk=local> SELECT columns[25] AS gender, count(*) AS number, TRUNC(100.0 * count(*) / 31971575, 2) AS percent FROM dfs.`/tmp/.csv` GROUP BY columns[25] ORDER BY columns[25];+---------+-----------+----------+| gender | number | percent |+---------+-----------+----------+| | 9809 | 0.03 || 0 | 2 | 0.0 || 1 | 4414808 | 13.8 || 2 | 27546956 | 86.16 |+---------+-----------+----------+4 rows selected (31.79 seconds)

  • 2015 MapR Technologies 55

    0: jdbc:drill:zk=local> SELECT columns[0] AS pnum, columns[1] AS email FROM dfs.`/tmp/.csv` WHERE columns[1] = '[email protected]';+-----------+------------------------------+| pnum | email |+-----------+------------------------------+| 12655726 | [email protected] |+-----------+------------------------------+1 row selected (10.566 seconds)

  • 2015 MapR Technologies 56

    /tmp .view.drillJSON

    0: jdbc:drill:zk=local> CREATE VIEW dfs.tmp.`` AS SELECT. . . . . . . . . . . > CAST(columns[0] AS INT) AS id,. . . . . . . . . . . > CAST(columns[1] AS TIMESTAMP) AS createdon,. . . . . . . . . . . > CAST(columns[2] AS INT) AS createdby,. . . . . . . . . . . > CAST(columns[3] AS TIMESTAMP) AS updatedon,. . . . . . . . . . . > CAST(columns[4] AS INT) AS updatedby. . . . . . . . . . . > .... . . . . . . . . . . > FROM. . . . . . . . . . . > dfs.`/tmp/.csv`. . . . . . . . . . . > ;

  • 2015 MapR Technologies 57

    CSV 2642 $ ls Transactions2008-03-21_downloaded.csv 2010-08-19_downloaded.csv 2013-01-16_downloaded.csv2008-03-22_downloaded.csv 2010-08-20_downloaded.csv 2013-01-17_downloaded.csv2008-03-23_downloaded.csv 2010-08-21_downloaded.csv 2013-01-18_downloaded.csv2008-03-24_downloaded.csv 2010-08-22_downloaded.csv 2013-01-19_downloaded.csv2008-03-25_downloaded.csv 2010-08-23_downloaded.csv 2013-01-20_downloaded.csv2008-03-26_downloaded.csv 2010-08-24_downloaded.csv 2013-01-21_downloaded.csv2008-03-27_downloaded.csv 2010-08-25_downloaded.csv 2013-01-22_downloaded.csv2008-03-28_downloaded.csv 2010-08-26_downloaded.csv 2013-01-23_downloaded.csv2008-03-29_downloaded.csv 2010-08-27_downloaded.csv 2013-01-24_downloaded.csv2008-03-30_downloaded.csv 2010-08-28_downloaded.csv 2013-01-25_downloaded.csv2008-03-31_downloaded.csv 2010-08-29_downloaded.csv 2013-01-26_downloaded.csv2008-04-01_downloaded.csv 2010-08-30_downloaded.csv 2013-01-27_downloaded.csv2008-04-02_downloaded.csv 2010-08-31_downloaded.csv 2013-01-28_downloaded.csv2008-04-03_downloaded.csv 2010-09-01_downloaded.csv 2013-01-29_downloaded.csv...

  • 2015 MapR Technologies 58

    10 0: jdbc:drill:zk=local> columns[19] AS TXT_COUNTRY, count(*) AS number from dfs.`/tmp/Transactions` GROUP BY columns[19] ORDER BY count(*) DESC LIMIT 10;Transactions` GROUP BY columns[19] ORDER BY count(*) DESC LIMIT 10;+--------------+----------+| TXT_COUNTRY | number |+--------------+----------+| US | 7591509 || CA | 823746 || BR | 197032 || AU | 146745 || TW | 118338 || CL | 109875 || ZA | 78126 || AR | 75314 || JP | 74165 || GB | 57901 |+--------------+----------+

  • 2015 MapR Technologies 59

    CSV $ cd Transactions$ for file in `ls *.csv`; do> dir=`echo $file | cut -c 1-7 | tr - /`> if [ ! -d $dir ]; then> mkdir -p $dir> fi> mv $file $dir> done$ ls2008 2009 2010 2011 2012 2013 2014 2015$ ls 200803 04 05 06 07 08 09 10 11 12$ ls 2008/032008-03-21_downloaded.csv 2008-03-25_downloaded.csv 2008-03-29_downloaded.csv2008-03-22_downloaded.csv 2008-03-26_downloaded.csv 2008-03-30_downloaded.csv2008-03-23_downloaded.csv 2008-03-27_downloaded.csv 2008-03-31_downloaded.csv2008-03-24_downloaded.csv 2008-03-28_downloaded.csv

  • 2015 MapR Technologies 60

    dir0,dir1 0: jdbc:drill:zk=local> SELECT dir0 AS year, dir1 AS month, TRUNC(SUM(CAST(REGEXP_REPLACE(REGEXP_REPLACE(columns[2], '^\\(', '-'), ',|\\)', '') AS DOUBLE)), 2) AS amount from dfs.`/tmp/Transactions` WHERE columns[2] 'AMOUNT' GROUP BY dir0, dir1 ORDER BY dir0, dir1;+-------+-------+-----------------+| dir0 | dir1 | amount |+-------+-------+-----------------+| 2008 | 03 | 97676.25 || 2008 | 04 | 266162.39 || 2008 | 05 | 1330456.45 || 2008 | 06 | 1630110.26 || 2008 | 07 | 2590733.03 || 2008 | 08 | 2743130.11 || 2008 | 09 | 2436655.66 || 2008 | 10 | 2534268.59 || 2008 | 11 | 2934391.31 |...

  • 2015 MapR Technologies 61

  • 2015 MapR Technologies 62

    Apache Drill

  • 2015 MapR Technologies 63

    Q & A @mapr_japan maprjapan

    [email protected]

    MapR

    maprtech

    mapr-technologies