nosql and sql - open analytics summit

7/28/2019 NoSQL and SQL - Open Analytics Summit

1/28

NoSQL and SQL Work Side-by-Side

to Tackle Real-time Big Data Needs

Allen Day

MapR Technologies


2/28

Me

Allen Day

Principal Data Scientist @ MapR

Human Genomics / Bioinformatics

(PhD, UCLA School of Medicine)

@allenday

[email protected]

[email protected]
mailto:[email protected]:[email protected]:[email protected]:[email protected]


3/28

You

Im assuming that the typical attendee:

is a software developer

is interested and familiar with open source

is familiar with Hadoop, relational DBs

has heard of or has used some NoSQL technology


4/28

Big Data Workloads

Offline

ETL

Model creation & clustering & indexing

Web Crawling

Batch reporting

Online

Lightweight OLTP

Classification & anomaly detection

Stream processing

Interactive reporting

SQL


5/28

What is NoSQL? Why use it?

Traditional storage (relational DBs) are unable toaccommodate increasing # and variety ofobservations

Culprits: sensors, event logs, electronic payments

Solution: stay responsive by relaxing ACID storagerequirements

Denormalize (#)

Loosen schema (variety), loosen consistency

This is the essence of NoSQL


6/28

NoSQL Impact on Business Processes

Traditional business intelligence (BI) tech stackassumes relational DB storage Company decisions depend on this (reports, charts)

NoSQL collected data arent in relational DB Data volume/variety is still increasing

Tech and methods are still in flux

Decoupled data storage and decision supportsystems BI cant access freshest, largest data sets

Very high opportunity cost to business


7/28

Ideal Solution Features

Scalable & Reliable

Distributed replicated storage

Distributed parallel processing

BI application support

Ad-hoc, interactive queries

Real-time responsiveness

Flexible

Handles rapid storage and schema evolution

Handles new analytics methods and functions

Hadoop FS

Map/Reduce, YARN{

SQL Interface{Extensible for NoSQL,Advanced Analytics{


8/28

From Ideals to Possibilities

Migrate NoSQL data/processing to SQL High cost to marshal NoSQL data to SQL storage

SQL systems lack advanced analytics capabilities

Migrate SQL data to NoSQL Breaks compatibility for BI-dependent functions, e.g.

financial reporting

Limited support for relational operations (joins) high latency

NoSQL tech is still in flux (continuity)

Other Approaches? Yes. First lets consider a SQL/NoSQL use case


9/28

Impala

Interactive Queries & Hadoop

low-latency


10/28

Example Problem: Marketing Campaign

Jane is an analyst at ane-commerce company

How does she figure

out good targetingsegments for the nextmarketing campaign?

She has some ideasand lots of data

User

profiles

Transaction

information

Access

logs


11/28

Traditional System Solution 1: RDBMS

ETL the data from

MongoDB and Hadoop

into the RDBMS

MongoDB data must beflattened, schematized,

filtered and aggregated

Hadoop data must be

filtered and aggregated Query the data using

any SQL-based tool

User

profiles

Access

logs

Transaction

information


12/28

Traditional System Solution 2: Hadoop

ETL the data fromOracle and MongoDBinto Hadoop

MongoDB data must beflattened andschematized

Work with theMapReduce team to

write custom code togenerate the desiredanalyses

User

profiles

Access

logs

Transaction

information


13/28

Traditional System Solution 3: Hive

ETL the data from

Oracle and MongoDB

into Hadoop

MongoDB data must beflattened and

schematized

But HiveQL queries are

slow and BI toolsupport is limited

Marshaling/Coding

User

profiles

Access

logs

Transaction

information


14/28

What Would Google Do?

Distributed

File SystemNoSQL

Interactive

analysis

Batch

processing

GFS BigTable Dremel MapReduce

HDFS HBase ??? HadoopMapReduce

Build Apache Drill to provide a true open source

solution to interactive analysis of Big Data


15/28

Apache Drill Overview

Interactive analysis of Big Data using standardSQL

Fast

Low latency queries

Complement native interfaces andMapReduce/Hive/Pig

Open

Community driven open source project

Under Apache Software Foundation

Modern Standard ANSI SQL:2003 (select/into)

Nested data support

Schema is optional

Supports RDBMS, Hadoop and NoSQL

Interactive queries

Data analyst

Reporting

100 ms-20 min

Data mining

Modeling

Large ETL

20 min-20 hr

MapReduce

HivePig

Apache Drill


16/28

How Does It Work?

Drillbit(Coordinator)

SQL QueryParser

Query Planner

Drillbit(Executor)

Drillbit(Executor)

Drillbit(Executor)

SELECT * FROM

oracle.transactions,

mongo.users,

hdfs.events

LIMIT 1

Drill Client

Tableau

Drill ODBC Driver

Micro-Strategy

CrystalReports

Driver


17/28

How Does It Work?

Drillbits run on each node, designed to

maximize data locality

Processing is done outside MapReduce

paradigm (but possibly within YARN)

Queries can be fed to any Drillbit

Coordination, query planning, optimization,

scheduling, and execution are distributed

SELECT * FROM

oracle.transactions,mongo.users,

hdfs.events

LIMIT 1


18/28

Apache Drill: Key Features

Full ANSI SQL:2003 support Use any SQL-based tool

Nested data support Flattening is error-prone and often impossible

Schema-less data source support Schema can change rapidly and may be record-specific

Extensible

DSLs, UDFs Custom operators (e.g. k-means clustering)

Well-documented data source & file format APIs


19/28

How Does Impala Fit In?Impala Strengths

Beta currently available

Easy install and setup on top of

Cloudera

Faster than Hive on some queries

SQL-like query language

Questions

Open Source Lite

Lacks RDBMS support

Lacks NoSQL support beyond

HBase

Early row materializationincreases footprint and reduces

performance

Limited file format support

Query results must fit in memory!

Rigid schema is required

No support for nested data

SQL-like (not SQL)

Many important features are coming soon.

Architectural foundation is constrained. No community development.


20/28

Drill Status: Alpha Available July

Heavy active development by multiple organizations Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho

Available

Logical plan syntax and interpreter

Reference interpreter In progress

SQL interpreter

Storage engine implementations for Accumulo, Cassandra, HBase and

various file formats

Significant community momentum

Over 200 people on the Drill mailing list

Over 200 members of the Bay Area Drill User Group

Drill meetups across the US and Europe

Beta: Q3


21/28

Why Apache Drill Will Be Successful

Resources Contributors have

strong backgrounds

from companies like

Oracle, IBM Netezza,

Informatica, Clustrix

and Pentaho

Community Development done in

the open

Active contributors

from multiple

companies

Rapidly growing

Architecture Full SQL

New data support

Extensible APIs

Full Columnar

Execution

Beyond Hadoop

Bottom Line: Apache Drill enables NoSQL and SQL

Work Side-by-Side to Tackle Real-time Big Data Needs


22/28

Me

Allen Day

Principal Data Scientist @ MapR

@allenday

[email protected] [email protected]
mailto:[email protected]:[email protected]:[email protected]:[email protected]


23/28


24/28

ADDITIONAL SLIDES


25/28

Full SQL (ANSI SQL:2003)

Drill supports SQL (ANSI SQL:2003 standard)

Correlated subqueries, analytic functions,

SQL-like is not enough

Use any SQL-based tool with Apache Drill

Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL,

Standard ODBC and JDBC drivers

Drill%WorkerDrill%Worker

Driver

Client

Drillbit

SQL%Query%

Parser

Query%

PlannerDrillbits

Drill%ODBC%

Driver

Tableau

MicroStrategy

Excel

SAP%Crystal%

Reports


26/28

Nested Data

Nested data is becoming prevalent

JSON, BSON, XML, Protocol Buffers, Avro, etc.

The data source may or may not be aware

MongoDB supports nested data natively

A single HBase value could be a JSON document

(compound nested type) Google Dremels innovation was efficient columnar

storage and querying of nested data

Flattening nested data is error-prone and oftenimpossible

Think about repeated and optional fields at everylevel

Apache Drill supports nested data

Extensions to ANSI SQL:2003

enum Gender {

MALE, FEMALE

}

record User {

string name;

Gender gender;

long followers;

}

{

"name": "Homer",

"gender": "Male",

"followers": 100

children: [

{name: "Bart"},

{name: "Lisa}

]

}

JSON

Avro


27/28

Schema is Optional

Many data sources do not have rigid schemas Schemas change rapidly

Each record may have a different schema, may be sparse/wide

Apache Drill supports querying against unknown schemas

Query any HBase, Cassandra or MongoDB table User can define the schema or let the system discover it

automatically System of record may already have schema information

No need to manage schema evolution

Row Key CF contents CF anchor

"com.cnn.www" contents:html = "" anchor:my.look.ca = "CNN.com"

anchor:cnnsi.com = "CNN"

"com.foxnews.www" contents:html = "" anchor:en.wikipedia.org = "Fox News"


28/28

Flexible and Extensible Architecture

Apache Drill is designed for extensibility

Well-documented APIs and interfaces

Data sources and file formats Implement a custom scanner to support a new source/format

Query languages SQL:2003 is the primary language

Implement a custom Parser to support a Domain Specific Language

UDFs

Optimizers

Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration

Operators Custom operators can be implemented (e.g. k-Means clustering)

Operator push-down to data source (RDBMS)

nosql and sql - open analytics summit

Documents