4 ways to scale interactive bi and analytics on a data lake
TRANSCRIPT
© 2017 MapR Technologies 1
+4 Ways to Scale Interactive BI and Analytics on a Data Lake
Sameer Nori, Saurabh Mahapatra, MapRSteve Wooledge, Priyank Patel, Arcadia DataApril 5th, 2017
© 2017 MapR Technologies 2
Agenda
• Market Trends & Data Lakes
• MapR Platform, Customer Usage & Apache Drill
• The Pros and Cons of Four Big Data BI Methods: BI Servers, Fast SQL Engines, Cubes, and Data Native
• How Arcadia Data Integrates with MapR
© 2017 MapR Technologies 3
MARKET TRENDS AND DATA LAKES
© 2017 MapR Technologies 4
Big Data Deployment Stage
4
© 2017 MapR Technologies 5
What and Why Data Lakes?
• Customers looking to establish next- gen applications/analytics platform
• Capturing large volumes of new data– Machine/App logs– Social data– IoT
• Put all data in play– Near-line storage for cold data– Maintain access and query capability
• Bridging data silos– Aggregating data source across business units
• Regulatory requirements
© 2017 MapR Technologies 6
Increasingly More Intelligent Applications
Scale
Big data storage with commodity economics
Data warehouse offload
Data lakeData hubBatch analytics
Customer 360
Real-time monitoringOperational analytics
IoT monitoring
SIEM
Recommendation enginesAnomaly detectionPredictive analyticsFraud detectionSelf-service analytics
Machine/deep learningHigh-frequency decisions
Connected car
Autonomous driving
Disruptive innovativeapplications
COST REDUCTION
CAPABILITYCOMPLEXITY
© 2017 MapR Technologies 7
MAPR PLATFORM, CUSTOMER USAGE & APACHE DRILL
© 2017 MapR Technologies 8
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
Dat
aPr
oces
sing
Web-Scale StorageMapR-FS MapR-DB
Search and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and
Managed Services
Search and Others
Unified M
anagement and M
onitoring
Search and
Others
Event StreamingDatabase
Custom Apps
MapR Converged Data Platform
HDFS API POSIX, NFS Kafka APIHBase API OJAI API
© 2017 MapR Technologies 9
Data Lake Architecture
MapR-DB: time series, structured
data
MapR-FS: emails, blogs, tweets, log files, unstructured
data
NFS/Sqoop/Flume: pure log files
Agile, self-service
data exploration
ETL into operational reporting formats (e.g.,
Parquet)
Multi-tenancy: job/data placement control, volumes
Access controls: file, table, column,
column family, doc, sub-doc
levels
SourcesRELATIONAL, SAAS, MAINFRAME
LOG FILES, CLICKSTREAMS
BLOGS, TWEETS,LINK DATA
Auditing: compliance, analyze
user accesses
Snapshots:track data lineage
and history
Table Replication: global multi-master, business continuity
MapR Converged Data Platform
Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams
© 2017 MapR Technologies 10
DATA SOURCES
SMARTUGAPNDBDeath MasterODARDetectsSanction ProviderSEAL-TOPS837iPAPICDB… and more.
Sqoop, NFS DrillHive, Pig
Ingest, ETL, Batch Processing Interactive SQL
UHG Data Lake Architecture Data MovementData Access
Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams
MapR Converged Data Platform
OPERATIONAL APPS, MEMBER 360, CALL CENTER
VISUALIZATION
ANALYTICS
SEARCH
© 2017 MapR Technologies 11
Insight as Service AppsTransunion PRAMA Self-Service Analytics
“What started as an IT-led cost-reduction
project focused on operational savings
has turned into a strategic platform.”
“It’s like a Swiss Army knife. We’re doing
data processing, interactive SQL, and
statistical algorithms on the data. We can
try different avenues and make a business
case. It starts loose and branches out into
new businesses and new revenue
streams.”
Kevin McClowry, Director of Analytic Solution Development
© 2017 MapR Technologies 12
Getting Value from the Data Lake
MAPR-FS: WEB SCALE STORAGE
BI Users
JSON
SQL Users
Parquet Delimited Text
?
© 2017 MapR Technologies 13
Getting Value from the Data Lake
MAPR-FS: WEB SCALE STORAGE
BI Users
JSON Parquet Delimited Text
?
APACHE DRILL
SQL Users
© 2017 MapR Technologies 14
Apache Drill: Unified SQL Layer• In-memory SQL execution engine
• Built from the ground up– Query Hadoop native formats– Leverages Hadoop ecosystem components – YARN, ZooKeeper, Hive
• ANSI SQL syntax to query anything on Hadoop
• Storage plugins– Library of existing storage plugins MapR-DB, MapR-FS, Hive– Custom storage plugins can be developed
• Industry standard APIs– ODBC/JDBC, ANSI SQL makes it easy for BI integration– REST
© 2017 MapR Technologies 15
Drill’s Data Model is Flexible
JSONBSON
HBase
ParquetAvro
CSVTSV
Dynamic schemaFixed schema
Complex
Flat
Flexibility
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flex
ibilit
y
© 2017 MapR Technologies 16
Improving Performance of Known Workloads
• Interactive needs– Sub-second response times
• Drill is designed for fast interactive SQL workloads– Run them ASAP, fail fast, move on– No assumptions on what lies underneath and what’s coming in
• Schema could change• Files could change• Queries could change
• How do you improve performance of well-understood workloads with high concurrency?
© 2017 MapR Technologies 17
4 Ways to Scale BI on Data Lakes
Steve WooledgePriyank Patel
© 2017 MapR Technologies 18
• Founding team from Teradata, Aster, 3PAR, IBM DB2
• On-cluster visual analytics and BI
• Large production customers in the Fortune 500
• 100s of companies use Arcadia Instant & Arcadia Enterprise
• Access all your data for agile enterprise-wide BI
Create businessvalue fromBig Data
– OUR FOUNDING VISION –
© 2017 MapR Technologies 19
Four Approaches for Big Data Analytics
Scale
Agility
Summary data only.
Move Data to BI Server
Separate BI Server
BI Server
© 2017 MapR Technologies 20
Strategy 1 of 4: Separate BI Server
Move Data to BI Server
BI & Visualization Server
Pros
Least Costly Use existing BI tools
Cons
✘ Shallow insights – summary data✘ Requires IT/DBA: new views & data
movement✘ Separate security models✘ Not real-time: batch data updates✘ No access to unstructured data✘ Heaviest burden on network
© 2017 MapR Technologies 21
Four Approaches for Big Data Analytics
Scale
Agility
Summary data only
Simple SQL. 1-5 users.
Fast SQL + BI Tools
(ODBC/JDBC, Hive, Spark,
Impala, Drill …)
BI Server
Move Data to BI Server
Separate BI Server
BI Server
© 2017 MapR Technologies 22
Strategy 2 of 4: Fast SQL + BI Tools
Pros
Can get detailed data with skilled data scientists/engineers
Performs better then direct connect
Cons
✘ Lower user concurrency✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data
(requires schema)✘ Cost - Manage security in multiple
tools, separate administration for metadata
(ODBC, JDBC, Hive, Spark, Drill, Impala
…)BI & Visualization Server
© 2017 MapR Technologies 23
Four Approaches for Big Data Analytics
Scale
Agility
Static cubes only - No granular data access.
Summary data only
Simple SQL. 1-5 users.
Middleware Application Cubes
Edge Node
Fast SQL + BI Tools
(ODBC/JDBC, Hive, Spark,
Impala, Drill …)
BI Server
Move Data to BI Server
Separate BI Server
BI Server
© 2017 MapR Technologies 24
Strategy 3 of 4: Middleware Application CubesPros
Use existing BI tools Higher user concurrency
Cons
✘ Lacks ad-hoc freedom - Requires IT/DBA for new views
✘ Not real-time: batch data updates✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data
(requires schema)✘ Cost – Multiple tools & data
duplication✘ Increased administration – Separate
security models, administration
Static View on Edge Node
© 2017 MapR Technologies 25
Four Approaches for Big Data Analytics
Scale
Agility
Static cubes only - No granular data access.
Summary data only
Simple SQL. 1-5 users.
Middleware Application Cubes
Edge Node
Fast SQL + BI Tools
(ODBC/JDBC, Hive, Spark,
Impala, Drill …)
BI Server
Move Data to BI Server
Separate BI Server
BI Server
Data-Native Visual Analytics
Native in-Cluster
Real-time & dynamic - 100s to 1000s of users.
© 2017 MapR Technologies 26
Strategy 4 of 4: Data-Native Visual Analytics & Apps
Pros
Greatest user concurrency Linear scalability Agility for analysts (drill to detail) Supports complex data sources Real time In-Hadoop advanced analytics Lowest TCO: simplified architecture
Cons
✘ Newer technology and approach✘ Requires some Hadoop skills to set
up and maintain
Data-Native BI & Analytics
© 2017 MapR Technologies 27
Big Data Analytics: AlternativesCapability Separate
BI ServerHadoop SQL
Engines + BI ToolBig Data “Cubes”
Data-NativeVisual Analytics
Dashboards and reporting ✓ ✓ ✓ ✓Real-time visualizations ✘ ✘ ✘ ✓Data Applications ✘ ✘ ✘ ✓High user concurrency ✓ ✘ ✓ ✓Ad-hoc drill to detail ✘ -- ✘ ✓In-Hadoop advanced analytics(e.g., customer engagement flows, micro-segmentation) ✘ ✘ ✘ ✓Multi-structured data access (e.g. NoSQL, S3, files, search) -- ✓ ✘ ✓Unified Security ✘ ✘ ✘ ✓Unified Administration ✘ ✘ ✘ ✓Lower TCO ✘ ✘ ✘ ✓
© 2017 MapR Technologies 28
Forrester Hadoop Native BI Wave
“Put Your BI right where Your Data Is” Recognizing the need to move BI processing
next to where the data is (in Hadoop) “Other BI vendors will surely follow this trend;
it’s a question of when, not if”
Traditional BI and Visualization tools – MicroStrategy, Tableau, QlikTech are not native to Hadoop
Arcadia Data scored highest (5/5) for Hadoop/Spark architecture
Source: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016
© 2017 MapR Technologies 29
Forrester Wave: Native Hadoop BI Platforms
Data Preparation Capabilities User InterfaceSource: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016
© 2017 MapR Technologies 30
Businesses are Hamstrung with Legacy BI on Data Lakes
Data summarization
Big data fidelity loss
No collaboration
Higher security risk
Operational complexity
High TCO - multiple systems
BI/VIZ TOOLS
BI/SERVER(CUBES)
DATA MART(EXTRACTS)
DATA WAREHOUSE (EXTRACTS)
ALLDATA
<EXTRACTS>
DATA USERS/ANALYSTS
Machine Data OLTPCRM
MarketingAutomation
Product / App Logs Web Logs
Operational Data Sources
Data Lake
© 2017 MapR Technologies 31
Arcadia Makes it Simple
Machine Data OLTPCRM
MarketingAutomation
Product / App Logs Web Logs
Operational Data Sources
Data Lake
NO DATA MOVEMENT
High-definition data analysis
Collaborative, real-time insights on data
More secure
Lower TCO and complexity
© 2017 MapR Technologies 32
Dat
aPr
oces
sing
Web-Scale StorageMapR-FS MapR-DB
Search and
Others
MapR Streams
Cloud and
Managed Services
Search and Others
Unified M
anagement and M
onitoring
Event StreamingDatabase
Custom Apps
MapR Converged Data Platform
Analytical Views™
BI &
Viz
© 2017 MapR Technologies 33
The First Data-Native Visual Analytics Platform for Big Data
Arcadia Visualization Engine
Arcadia Analytic Platform(Smart Acceleration™)
On-Premises
Drag-and-drop Visual Analytics & Dashboards
HybridCloud
Custom Data Applications
…BIG DATA OSDistributed execution,
data storage, metadata, security
IN-CLUSTER ANALYTICS ENGINEScales linearly with cluster for speed and easier management
WEB-BASED INTERFACEDrag & drop interface for
visual analytics & app workflow
Data
Pla
tform
© 2017 MapR Technologies 34
Linearly Scale Production Workloads with Smart Acceleration
MapR Converged Data Platform Cluster
Results(100x Faster)
Consumption Layer
Processing Layer
Smart Acceleration™
1. Start with exploration of raw data, no need to determine design of acceleration structures such as cubes ahead of time
2. Recommendation engine generates AVs (derived forms of raw data) based on dynamic data usage within cluster
3. Re-routes data queries to AVs transparently providing automated acceleration when needed for production/high concurrency uses
Automatically modeled and maintained within cluster
Keep logical data models simple without needing to target specific data cube structures
1
2
3Queries
Queries automatically redirected
Analytical Views
Recommendation Engine
Stores Derived Forms of Raw Data in Cluster
Raw Data in MapR
© 2017 MapR Technologies 35
Campaign Analysis ApplicationUnderstand high level metrics with the ability to drill down to details
Augment analysis with a variety of data types & sources such as actual display ad images
© 2017 MapR Technologies 36
Retail Store Geo Analysis
YoY Growth metrics plotted by county for the chose sub-brand
Trellising allows for quick trend analysis across multiple stores.
Here showing store sales vs trade area sales to correlate potential shifts in buying pattern
Choose a specific state to drill down to county level
© 2017 MapR Technologies 37
Retail Stores Drill Down
Interactive maps allows for easy visualization of spatial data zooming into details
© 2017 MapR Technologies 38
PRODUCT DEMONSTRATION
38
© 2017 MapR Technologies 39
Relational, Real Time, or NoSQL Connections
© 2017 MapR Technologies 40
ResourcesEngage with us!
1. Read the Data Lake & Analytics Ebookhttps://mapr.com/definitive-guide-bi-and-analytics-data-lake/
2. Forrester Wave: Native Hadoop BIwww.arcadiadata.com/lp/forrester-wave-hadoop-bi-research-report/
3. Get Started • MapR Converged Data Platform
https://www.mapr.com/get-started-with-mapr• Arcadia Instant
www.arcadiadata.com/download