sql on hadoop for enterprise analytics
TRANSCRIPT
![Page 1: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/1.jpg)
A LITTLE BIT OF HISTORY
Everything old is new again. SQL Forever.
![Page 2: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/2.jpg)
The story so far
Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!
![Page 3: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/3.jpg)
Everything old is new again
Existing architecture keeps reappearing
It takes time to figure out what tools are right for what jobs
SQL is still the best tool for business analytics
![Page 4: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/4.jpg)
A long long time ago…
![Page 5: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/5.jpg)
Growing pains
Late 1990
![Page 6: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/6.jpg)
Database problems
Database outage
Data integrity issues
Data latency
Late 1990
![Page 7: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/7.jpg)
Master Slave
Late 1990
![Page 8: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/8.jpg)
Transactions
Late 1990
![Page 9: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/9.jpg)
Performance
Late 1990
![Page 10: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/10.jpg)
By the time I graduated, SQL was on its last legs
2009
![Page 11: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/11.jpg)
Cache all the things!
2009
![Page 12: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/12.jpg)
Stop copying Twitter!
2009
![Page 13: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/13.jpg)
SQL golden age ends, NoSQL takes off
2010
Column Graph
Key-Value Document
![Page 14: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/14.jpg)
NoSQL
2010
![Page 15: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/15.jpg)
Awesome things about NoSQL
No SQL, normal languages as APIs!
Non relational!
FAST!
2010
![Page 16: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/16.jpg)
Remember ORMs?
~2000
![Page 17: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/17.jpg)
Active Record
~2000
![Page 18: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/18.jpg)
ORMs 👎
2011
![Page 19: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/19.jpg)
Remember EAV(Entity Attribute Value)?
1968
![Page 20: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/20.jpg)
Kind of looks like columns…
1968
![Page 21: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/21.jpg)
Modern EAV
2010
![Page 22: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/22.jpg)
Tedious to query
2010
![Page 23: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/23.jpg)
Voila!
2010
![Page 24: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/24.jpg)
No joins is a feature!
2010
![Page 25: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/25.jpg)
NoSQL has some rough bumps
2010
![Page 26: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/26.jpg)
NoSQL has A LOT of rough bumps…
2011
![Page 27: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/27.jpg)
Throwback Thursday!
2011
![Page 28: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/28.jpg)
Lock the doors
2011
![Page 29: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/29.jpg)
MPP columnar DBs! Wait... SQL is back?!
2015
![Page 30: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/30.jpg)
Hadoop on SQL
2016
![Page 31: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/31.jpg)
A long long time ago…
![Page 32: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/32.jpg)
What’s next?
~2020?
![Page 33: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/33.jpg)
What’s next?
~2020?
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon
![Page 34: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/34.jpg)
SQL is far past hype
![Page 35: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/35.jpg)
Fin
“If it ain’t broke, don’t fix it”
![Page 36: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/36.jpg)
CUSTOMER STORYBuilding a event analytics pipeline
using Hadoop and Spark
![Page 37: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/37.jpg)
Why Consider a Big Data Pipeline?
37
You are rapidly exceeding the limits of your existing database
Everything on your website can be
analyzed.
Waiting until the next day isn’t for
you
Data comes and goes to many places, and you want
one process for it
![Page 38: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/38.jpg)
Big DATA CULTURE
38
Summary data is not good enough
Company is mandating new technologies
You want to build a data driven culture
Big SQL is the heart of a data-driven culture
![Page 39: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/39.jpg)
CASE STUDY
39
A major healthcare provider wants to create a web event pipeline that:
During periods of healthcare registration and new coverage
start and can dial back the rest of the year
Massive Scaling Large data volumes
10-15M customers worth of data. Provides data for
analysis in under 1 minute.
AND Utilizes existing in house technologies (such as Cloudera Impala)
Page loadsRegistrations
LoginsErrors
All events processed
![Page 40: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/40.jpg)
Solution: Build an event processing framework
5
Events
Event Collector
Hadoop?
![Page 41: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/41.jpg)
High Level Process
6
Events
Event Collector
Message ProcessingHDFS
Looker
To be designed
![Page 42: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/42.jpg)
Why is Hadoop so hard?
7
Need to write in Java and Scala
We don’t have structure
Not easy to get data out into BI tools
Event Collectors don’t tend to feed to HDFS
out of the box
Typically follow a batch processing
framework
![Page 43: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/43.jpg)
Ingestion mechanism
8
Low-Latency In flight transformation and
processing
Ability to populate multiple destinations
Our ideal ingestion would have three key aspects
![Page 44: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/44.jpg)
Spark vs Storm
9
VS
• Own Master Server • Run on HDFS• Micro batching • Exact once delivery
(eliminates vulnerability)
• Not native to Hadoop• Less Developed• One at a time• ETL in flight• Sub second latency
Two of the major players in data streaming / processing
![Page 45: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/45.jpg)
Flume
45
Source Interceptor Selector Channel Sinks
Managed by the Flume Agent
Web Server
Web Server
Web Server
Web Server Investor Channel
HDFSNo in flight transformation, so this just needs to meet workload
![Page 46: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/46.jpg)
KAFKA
46
Broker
Broker
Broker
Producer Broker Consumer
Producer
Producer
Spark Streaming
Other
ZooKeeper
Broker
![Page 47: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/47.jpg)
Flume vs. Kafka
12
Use Both: Out-of-the box with Flafka and native connectors
Flume
Kafka
Source
SparkCustom
connector
Customconnector
Flume KafkaSource Spark
![Page 48: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/48.jpg)
Storing the output
48
Data can be queried via Hive, Impala, or Spark SQL
Cloudera is our Enterprise choice
We can process a subset in-stream with Mlib or other machine learning
algorithms
Output summaries to other RDBMS
systems
Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because
![Page 49: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/49.jpg)
Final Result
14
Events
Event Collector Kafka
Flume Spark SQL Cloudera
Other storage (RDBMS)
Other storage (logs)
![Page 50: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/50.jpg)
Pipeline Summary
15
Add data to any point of the pipeline
Kafka, Flume, Impala, Looker without many
custom connectors
Pipeline includes additional sources like teradata, oracle
Add in-flight predictive model training and execution without significant additional processing time
Our pipeline provides several points for flexibility as well as meets our key priorities.
![Page 51: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/51.jpg)
Priority # 1: Scale
Kafka is easy to scale, As more volume comes in, adding new brokers can be automated using the Partition Reassignment Tool
By monitoring batch times in Looker on Spark SQL, we can alert when we need to scale up the cluster using Scheduled Looks
16
![Page 52: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/52.jpg)
Priority #2: Flexibility
17
Different events can be parsed out to different Spark streaming applications with Kafka topics (Or another type of consumer)
Add more data at any point (flume, kafka producer, or directly to spark)
Looker connects to wherever the data lands, as long as we can query it. Perform analysis IN CLUSTER
![Page 53: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/53.jpg)
Priority #3 Speed Analyzing the stream
53
Events per hour
Identify missing batches
Volume and Timing
Right sizing hardware
Duplicate events
And missing information
![Page 54: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/54.jpg)
Priority #4: In house Technologies
19
Provide access to Hadoop/Impala via a centralized data hub:A single place to access web based reports, explores, BI tools and code libraries
Enable users to ask questions and query web data without writing SQL or knowing about the pipeline
![Page 55: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/55.jpg)
Analyzing the stream
55
Looking for Lost data
=/=
![Page 56: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/56.jpg)
Analyzing the stream
21
By connecting Looker to various points in the stream we can verify complete loads:
We also mask the location of information, one dashboard may show a variety of reliable sources.
• Impala SQL• Source Logs• Summary Reports
![Page 57: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/57.jpg)
Other uses and benefits
57
Match data in flight to find bad user
accounts
In flight alerts for missing
data
Analysis without needing to know
the location in the stream
SQL on Hadoop BI solution doesn’t
require new skillset
![Page 58: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/58.jpg)
THANK YOU!
![Page 59: SQL on Hadoop for Enterprise Analytics](https://reader037.vdocuments.net/reader037/viewer/2022103010/58a88dcd1a28ab68208b4ab7/html5/thumbnails/59.jpg)
Sources
http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolorhttp://seldo.com/weblog/2011/08/11/orm_is_an_antipatternhttp://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-Security.pdfhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodbhttps://www.percona.comhttp://techcrunch.com/http://mashable.com/