self-service bi for big data applications using apache drill (big data & analytics insight at...

31
© 2015 MapR Technologies 1 © 2015 MapR Technologies Self-service BI for big data applications using Apache Drill

Upload: mats-uddenfeldt

Post on 17-Feb-2017

1.077 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 1© 2015 MapR Technologies

Self-service BI for big data applications using Apache Drill

Page 2: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 2

SEMI-STRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Data Is Doubling Every Two Years

Unstructured data will account

for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

To

tal D

ata

Sto

red

IT

Resources

Page 3: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 3

1980 2000 20101990 2020

Fixed schema

DBA controls structure

Dynamic / Flexible schema

Application controls structure

NON-RELATIONAL DATASTORESRELATIONAL DATABASES

GBs-TBs TBs-PBsVolume

Database

Data Increasingly Stored in Non-Relational Datastores

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

Page 4: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 4

How To Bring SQL Into An Unstructured Future?

Familiarity of SQL Agility & Flexibility of NoSQL

• SQL

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• Scalability

• No schema management

– HDFS (Parquet, JSON, etc.)

– HBase

– …

• No transform or silos of data

Page 5: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 5

Industry's First

Schema-free SQL engine

for Big Data

Page 6: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 6

Apache Drill Brings Flexibility & PerformanceAccess to any data type, any data source

• Relational

• Nested data

• Schema-less

Rapid time to insights

• Query data in-situ

• No Schemas required

• Easy to get started

Integration with existing tools

• ANSI SQL

• BI tool integration

Scale in all dimensions

• TB-PB of scale

• 1000’s of users

• 1000’s of nodes

Granular Security

• Authentication

• Row/column level controls

• De-centralized

Page 7: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 7

Extending Self Service to Schema-free dataA

gil

ity &

Bu

sin

ess V

alu

e

Use cases for BI

IT-Driven BI

Self-Service BI

Schema-Free

Data Exploration

IT-Driven BI IT-Driven BI

Self-Service BI

Analyst-driven with

no IT dependency

Analyst-driven with

IT support for ETL

IT-created

reports, spreadsheets

1980s -1990s 2000s Now

Page 8: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 8

Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling Transformation

Data movement

(optional)

Users

Hadoop data Users

Governed

approach

Exploratory

approach

New Business questionsSource data evolution

Total time to insight: weeks to months

Total time to insight: minutes

Page 9: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 9

Drill’s Role in the Enterprise Data Architecture

Raw data

• JSON, CSV, ...

“Optimized” data

• Parquet, …

Centrally-structured data

• Schemas in Hive Metastore

Relational data

• Highly-structured data

Hive, Impala, Spark SQL

Oracle, Teradata

Exploration

(known and unknown questions)

Page 10: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 10

Business Benefits

Rapid time-to-value for business analysts:

SQL specialists and BI analysts can query any dataset—including complex

nested data—instantly, versus waiting several weeks for data preparation by IT.

Efficiency with easy governance for IT:

IT can avoid unnecessary ETL cycles and schema maintenance activities, but

still ensure governance through easy-to-deploy granular access controls.

Accelerated big data adoption for businesses:

Organizations can use the existing and large SQL talent base and tools to

rapidly discover new business insights from big data.

Page 11: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 11© 2015 MapR Technologies

Quick TourSelf-Service Data Exploration with Apache Drill

Page 12: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 12

Data is growing fast and scattered in various silo’s:

Website click logs

• JSON files

Product database

• MapR-DB NoSQL

Customers

• CSV files

Page 13: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 13

Apache Drill: SQL in a Non-Relational World

• ANSI SQL

• BI (Tableau, MicroStrategy, etc.)

• Low latency

• Scalability

• Agility

• Create and maintain schemas in

advance:

– HDFS (Parquet, JSON, etc.)

– HBase

– …

• Transform, copy, or move data

2 DON’T WANT WANT

Page 14: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 14

Closing The Gap Between Different Datasources using Drill

Product database

• Prod_id

• Productname

• Category

• Price

NoSQL

Hbase / MapR-DB

Website click logs

• Trans_id

• Sess_date

• Cust_id

• Device

• Prod_id

• Purch_flag

JSON

Customers

• Cust_id

• Customername

• State

• Gender

• Agg_rev

• Age

• Membership

CSV

Page 15: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 15© 2015 MapR Technologies

Demo

Page 16: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 16

In lieu of the live demonstration please find links below:

• Apache Drill with Tableau (4:28):

https://www.youtube.com/watch?v=EH0_vRTAkyk

• Twitter analytics with Apache Drill and Microstrategy (5:02):

https://www.youtube.com/watch?v=-gqwgahtc2Y

• Analyzing JSON and Packet Data with SAP Lumira and Apache

Drill: https://www.youtube.com/watch?v=s-fEATDI2wA

Page 17: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 17

Access control that scales

PAM Authentication +

User Impersonation

Fine-grained row and

column level access control

with Drill Views – no

centralized security

repository required

Files HBase Hive

Drill

View 1

Drill

View 2

UUU

User

User

Page 18: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 18

Granular security permissions through Drill views

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)Owner

Admins

Permission

Admins

Business Analyst Data Scientist

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose CA

John Boulder CO

Business Analyst View

Owner

Admins

Permission

Business

Analysts

Owner

Admins

Permission

Data

Scientists

Page 19: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 19© 2015 MapR Technologies

Case Studies

Page 20: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 20

Raw Data Exploration JSON Analytics DWH Offload …

Hive HBaseFiles Directories

{JSON}, Parquet

Text Files …

Self-Service Data ExplorationDirect access to any data store from familiar tools- ANSI SQL compatible

Page 21: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 21

Data Warehouse Offload with Drill & MapRUltimately replace existing expensive SQL analytics platform with Hadoop

• Apache Drill allows interactive analysis on large datasets with MapR as the

underlying platform that meets scale, reliability and data protection needs

• SQL users did not have to learn Pig, HiveQL or any other language and

continue to use Tableau and Squirrel on top of Drill

OBJECTIVES

CHALLENGES

SOLUTION

• Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB

• MapR platform with Drill delivers reliability and performance for the end users

• Leverage existing BI and SQL skill-sets on Hadoop without retraining

Business

Impact

Potential

• Mine credit card data and compares consumer shopping habits

• Require internal SQL specialists to gain instant access to data at all times

• Want to preserve instant access to data but a lower price point

• Need a system that is reliable, does not lose data and is fast

• Must be able to leverage the SQL skill sets in the company

Page 22: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 22

Telecom OEM application with Drill & MapRLeverage Drill’s JSON capabilities to create revenue-generating IOT services

• Apache Drill is being used to build the engine for the interactive experience

• Drill allows SQL queries on incoming JSON structures natively without

requiring any centralized schema definitions

• Drill connects to all BI tools using standard ODBC connectors

OBJECTIVES

CHALLENGES

SOLUTION

• Provide new revenue-generating services to mobile operators

• Enable deeper, instant intelligence about the networks and users

• Reduce maintenance costs - no IT intervention required for schema changes

Business

Impact

Potential

• Offer service to mobile operators to proactively monitor and improve their

subscriber experience

• Instant availability of data from diverse and disparate sources

• Data is very diverse and dynamic using JSON as the key format

• Require interactive, ad-hoc analysis capabilities via standard BI tools such

as Tableau and Spotfire

Page 23: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2015 MapR Technologies 23

Recap: Apache Drill enables Self Service SQL for Big data

AGILITYINSTANT INSIGHTS TO BIG DATA

FLEXIBILITYONE INTERFACE

FOR HADOOP & NOSQL

FAMILIARITYEXISTING SKILLS &

TECHNOLOGIES

• Direct queries on self

describing data

• No schemas or ETL

required

• Query HBase and

other NoSQL stores

• Use SQL to natively

operate on complex

data types (such as

JSON)

• Leverage ANSI SQL

skills and BI tools

• Plug-n-play with Hive

schema, file formats,

UDF’s

Page 24: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 24

Learn more and get started with Apache Drill

New to MapR and/or Drill?

– Get started with Free MapR On Demand training

– Test Drive Drill on cloud with Amazon EMR

– Learn how to use Drill with Hadoop using MapR sandbox

Ready to play with your data?

– Try out Apache Drill in 10 mins guide on your desktop

– Download Drill for your MapR cluster and start exploration

• Use both with relational and JSON datasets

– Comprehensive tutorials and documentation available

Ask questions – [email protected]

Page 25: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 25

Thank You

@mapr maprtech

[email protected]@mapr.com

MapRTechnologies

maprtech

mapr-technologies

Page 26: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 26© 2014 MapR Technologies

Backup Slides

Page 27: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 27

MapR with Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015

Key:

• Number indicates companies relative strength across all vectors

• Size of ball indicates company’s relative strength along individual vector

Like other vendors’

offerings, Drill

handles BI and

interactive queries with

great aplomb, but it is

designed to serve these

workloads with data

complexity that goes

well beyond the flat

structured data that

other SQL-on-

Hadoop systems deal

with.

Page 28: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 28

Drill Hive Impala Spark SQL

Key Use Cases Self-service Data Exploration

Interactive BI / Ad-hoc queries

Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines

/ Advanced analytic workflows

Data

Sources

Files support Parquet, JSON, Text, all

Hive file formats

Yes (all Hive file formats) Yes (Parquet, Sequence,

RC, Text, AVRO …)

Parquet, JSON, Text, all

Hive file formats

HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive

Beyond Hadoop Yes No No Yes

Data

Types

Relational Yes Yes Yes Yes

Complex/Nested Yes Limited No Limited

Metadata Schema-less

/Dynamic

schema

Yes No No Limited

Hive Meta store Yes Yes Yes Yes

SQL /

BI tools

SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) &

HiveQL

Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC

Beyond Memory Yes Yes Yes Yes

Optimizer Limited Limited Limited Limited

Platform Latency Low Medium Low Low (in-memory) / Medium

Concurrency High Medium High Medium

SQL technologies available on MapR

Page 29: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 29

Key Reasons for Selecting MapRRespondents who had prior experience with another Hadoop distribution*

* Apache Hadoop, Cloudera or Hortonworks

Page 30: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 30

MapR: The Only Platform Architected For Big, Fast, Reliable

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Spark Streaming

Storm

StreamingNoSQL & Search

Juju

Provisioning &

coordination

Savannah

ML, Graph

Mahout

MLLib

GraphX

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Pig

Cascading

Spark

Batch

MapReduce v1 & v2

Tez

HBase

Solr

Hive

Impala

Spark SQL

Drill

SQL

Sentry Oozie ZooKeeperSqoop

Flume

Data Integration& Access

HttpFS

Hue

MapR Data Platform(Random Read/Write)

MapR-FS(HDFS and NFS APIs)

MapR-DB(High-Performance NoSQL)

More efficient use of infrastructure

(30-50% lower TCO)

First new

database

designed

for

operational

real-time

Your choice

of SQL

Industry’s only mirroring,

point-in-time consistent snapshots

Trillion files

2-11x faster

Open source

Projects ‘inherit’

MapR’s platform

attributes

Page 31: Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-10-13)

© 2014 MapR Technologies 31

MapR: Best Solution for Customer Success

Premier

InvestorsHigh Growth

2X Growth In Direct Customers

90%Subscription Licenses

Software Margins

140% Dollar-based Net Expansion

700+ Customers

2X Growth In Annual

Subscriptions ( ACV)

Best Product

Apache Open Source