nosql and sql - open analytics summit
TRANSCRIPT
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
1/28
NoSQL and SQL Work Side-by-Side
to Tackle Real-time Big Data Needs
Allen Day
MapR Technologies
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
2/28
Me
Allen Day
Principal Data Scientist @ MapR
Human Genomics / Bioinformatics
(PhD, UCLA School of Medicine)
@allenday
mailto:[email protected]:[email protected]:[email protected]:[email protected] -
7/28/2019 NoSQL and SQL - Open Analytics Summit
3/28
You
Im assuming that the typical attendee:
is a software developer
is interested and familiar with open source
is familiar with Hadoop, relational DBs
has heard of or has used some NoSQL technology
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
4/28
Big Data Workloads
Offline
ETL
Model creation & clustering & indexing
Web Crawling
Batch reporting
Online
Lightweight OLTP
Classification & anomaly detection
Stream processing
Interactive reporting
SQL
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
5/28
What is NoSQL? Why use it?
Traditional storage (relational DBs) are unable toaccommodate increasing # and variety ofobservations
Culprits: sensors, event logs, electronic payments
Solution: stay responsive by relaxing ACID storagerequirements
Denormalize (#)
Loosen schema (variety), loosen consistency
This is the essence of NoSQL
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
6/28
NoSQL Impact on Business Processes
Traditional business intelligence (BI) tech stackassumes relational DB storage Company decisions depend on this (reports, charts)
NoSQL collected data arent in relational DB Data volume/variety is still increasing
Tech and methods are still in flux
Decoupled data storage and decision supportsystems BI cant access freshest, largest data sets
Very high opportunity cost to business
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
7/28
Ideal Solution Features
Scalable & Reliable
Distributed replicated storage
Distributed parallel processing
BI application support
Ad-hoc, interactive queries
Real-time responsiveness
Flexible
Handles rapid storage and schema evolution
Handles new analytics methods and functions
Hadoop FS
Map/Reduce, YARN{
SQL Interface{Extensible for NoSQL,Advanced Analytics{
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
8/28
From Ideals to Possibilities
Migrate NoSQL data/processing to SQL High cost to marshal NoSQL data to SQL storage
SQL systems lack advanced analytics capabilities
Migrate SQL data to NoSQL Breaks compatibility for BI-dependent functions, e.g.
financial reporting
Limited support for relational operations (joins) high latency
NoSQL tech is still in flux (continuity)
Other Approaches? Yes. First lets consider a SQL/NoSQL use case
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
9/28
Impala
Interactive Queries & Hadoop
low-latency
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
10/28
Example Problem: Marketing Campaign
Jane is an analyst at ane-commerce company
How does she figure
out good targetingsegments for the nextmarketing campaign?
She has some ideasand lots of data
User
profiles
Transaction
information
Access
logs
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
11/28
Traditional System Solution 1: RDBMS
ETL the data from
MongoDB and Hadoop
into the RDBMS
MongoDB data must beflattened, schematized,
filtered and aggregated
Hadoop data must be
filtered and aggregated Query the data using
any SQL-based tool
User
profiles
Access
logs
Transaction
information
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
12/28
Traditional System Solution 2: Hadoop
ETL the data fromOracle and MongoDBinto Hadoop
MongoDB data must beflattened andschematized
Work with theMapReduce team to
write custom code togenerate the desiredanalyses
User
profiles
Access
logs
Transaction
information
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
13/28
Traditional System Solution 3: Hive
ETL the data from
Oracle and MongoDB
into Hadoop
MongoDB data must beflattened and
schematized
But HiveQL queries are
slow and BI toolsupport is limited
Marshaling/Coding
User
profiles
Access
logs
Transaction
information
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
14/28
What Would Google Do?
Distributed
File SystemNoSQL
Interactive
analysis
Batch
processing
GFS BigTable Dremel MapReduce
HDFS HBase ??? HadoopMapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
15/28
Apache Drill Overview
Interactive analysis of Big Data using standardSQL
Fast
Low latency queries
Complement native interfaces andMapReduce/Hive/Pig
Open
Community driven open source project
Under Apache Software Foundation
Modern Standard ANSI SQL:2003 (select/into)
Nested data support
Schema is optional
Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
HivePig
Apache Drill
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
16/28
How Does It Work?
Drillbit(Coordinator)
SQL QueryParser
Query Planner
Drillbit(Executor)
Drillbit(Executor)
Drillbit(Executor)
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Drill Client
Tableau
Drill ODBC Driver
Micro-Strategy
CrystalReports
Driver
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
17/28
How Does It Work?
Drillbits run on each node, designed to
maximize data locality
Processing is done outside MapReduce
paradigm (but possibly within YARN)
Queries can be fed to any Drillbit
Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,mongo.users,
hdfs.events
LIMIT 1
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
18/28
Apache Drill: Key Features
Full ANSI SQL:2003 support Use any SQL-based tool
Nested data support Flattening is error-prone and often impossible
Schema-less data source support Schema can change rapidly and may be record-specific
Extensible
DSLs, UDFs Custom operators (e.g. k-means clustering)
Well-documented data source & file format APIs
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
19/28
How Does Impala Fit In?Impala Strengths
Beta currently available
Easy install and setup on top of
Cloudera
Faster than Hive on some queries
SQL-like query language
Questions
Open Source Lite
Lacks RDBMS support
Lacks NoSQL support beyond
HBase
Early row materializationincreases footprint and reduces
performance
Limited file format support
Query results must fit in memory!
Rigid schema is required
No support for nested data
SQL-like (not SQL)
Many important features are coming soon.
Architectural foundation is constrained. No community development.
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
20/28
Drill Status: Alpha Available July
Heavy active development by multiple organizations Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho
Available
Logical plan syntax and interpreter
Reference interpreter In progress
SQL interpreter
Storage engine implementations for Accumulo, Cassandra, HBase and
various file formats
Significant community momentum
Over 200 people on the Drill mailing list
Over 200 members of the Bay Area Drill User Group
Drill meetups across the US and Europe
Beta: Q3
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
21/28
Why Apache Drill Will Be Successful
Resources Contributors have
strong backgrounds
from companies like
Oracle, IBM Netezza,
Informatica, Clustrix
and Pentaho
Community Development done in
the open
Active contributors
from multiple
companies
Rapidly growing
Architecture Full SQL
New data support
Extensible APIs
Full Columnar
Execution
Beyond Hadoop
Bottom Line: Apache Drill enables NoSQL and SQL
Work Side-by-Side to Tackle Real-time Big Data Needs
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
22/28
Me
Allen Day
Principal Data Scientist @ MapR
@allenday
[email protected] [email protected]
mailto:[email protected]:[email protected]:[email protected]:[email protected] -
7/28/2019 NoSQL and SQL - Open Analytics Summit
23/28
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
24/28
ADDITIONAL SLIDES
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
25/28
Full SQL (ANSI SQL:2003)
Drill supports SQL (ANSI SQL:2003 standard)
Correlated subqueries, analytic functions,
SQL-like is not enough
Use any SQL-based tool with Apache Drill
Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL,
Standard ODBC and JDBC drivers
Drill%WorkerDrill%Worker
Driver
Client
Drillbit
SQL%Query%
Parser
Query%
PlannerDrillbits
Drill%ODBC%
Driver
Tableau
MicroStrategy
Excel
SAP%Crystal%
Reports
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
26/28
Nested Data
Nested data is becoming prevalent
JSON, BSON, XML, Protocol Buffers, Avro, etc.
The data source may or may not be aware
MongoDB supports nested data natively
A single HBase value could be a JSON document
(compound nested type) Google Dremels innovation was efficient columnar
storage and querying of nested data
Flattening nested data is error-prone and oftenimpossible
Think about repeated and optional fields at everylevel
Apache Drill supports nested data
Extensions to ANSI SQL:2003
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
{
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa}
]
}
JSON
Avro
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
27/28
Schema is Optional
Many data sources do not have rigid schemas Schemas change rapidly
Each record may have a different schema, may be sparse/wide
Apache Drill supports querying against unknown schemas
Query any HBase, Cassandra or MongoDB table User can define the schema or let the system discover it
automatically System of record may already have schema information
No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "" anchor:en.wikipedia.org = "Fox News"
-
7/28/2019 NoSQL and SQL - Open Analytics Summit
28/28
Flexible and Extensible Architecture
Apache Drill is designed for extensibility
Well-documented APIs and interfaces
Data sources and file formats Implement a custom scanner to support a new source/format
Query languages SQL:2003 is the primary language
Implement a custom Parser to support a Domain Specific Language
UDFs
Optimizers
Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration
Operators Custom operators can be implemented (e.g. k-Means clustering)
Operator push-down to data source (RDBMS)