4 ways to scale interactive bi and analytics on a data lake

40
© 2017 MapR Technologies 1 + 4 Ways to Scale Interactive BI and Analytics on a Data Lake Sameer Nori, Saurabh Mahapatra, MapR Steve Wooledge, Priyank Patel, Arcadia Data April 5 th , 2017

Upload: mapr-technologies

Post on 12-Apr-2017

64 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 1

+4 Ways to Scale Interactive BI and Analytics on a Data Lake

Sameer Nori, Saurabh Mahapatra, MapRSteve Wooledge, Priyank Patel, Arcadia DataApril 5th, 2017

Page 2: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 2

Agenda

• Market Trends & Data Lakes

• MapR Platform, Customer Usage & Apache Drill

• The Pros and Cons of Four Big Data BI Methods: BI Servers, Fast SQL Engines, Cubes, and Data Native

• How Arcadia Data Integrates with MapR

Page 3: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 3

MARKET TRENDS AND DATA LAKES

Page 4: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 4

Big Data Deployment Stage

4

Page 5: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 5

What and Why Data Lakes?

• Customers looking to establish next- gen applications/analytics platform

• Capturing large volumes of new data– Machine/App logs– Social data– IoT

• Put all data in play– Near-line storage for cold data– Maintain access and query capability

• Bridging data silos– Aggregating data source across business units

• Regulatory requirements

Page 6: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 6

Increasingly More Intelligent Applications

Scale

Big data storage with commodity economics

Data warehouse offload

Data lakeData hubBatch analytics

Customer 360

Real-time monitoringOperational analytics

IoT monitoring

SIEM

Recommendation enginesAnomaly detectionPredictive analyticsFraud detectionSelf-service analytics

Machine/deep learningHigh-frequency decisions

Connected car

Autonomous driving

Disruptive innovativeapplications

COST REDUCTION

CAPABILITYCOMPLEXITY

Page 7: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 7

MAPR PLATFORM, CUSTOMER USAGE & APACHE DRILL

Page 8: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 8

Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and

Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and

Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and

Others

Event StreamingDatabase

Custom Apps

MapR Converged Data Platform

HDFS API POSIX, NFS Kafka APIHBase API OJAI API

Page 9: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 9

Data Lake Architecture

MapR-DB: time series, structured

data

MapR-FS: emails, blogs, tweets, log files, unstructured

data

NFS/Sqoop/Flume: pure log files

Agile, self-service

data exploration

ETL into operational reporting formats (e.g.,

Parquet)

Multi-tenancy: job/data placement control, volumes

Access controls: file, table, column,

column family, doc, sub-doc

levels

SourcesRELATIONAL, SAAS, MAINFRAME

LOG FILES, CLICKSTREAMS

BLOGS, TWEETS,LINK DATA

Auditing: compliance, analyze

user accesses

Snapshots:track data lineage

and history

Table Replication: global multi-master, business continuity

MapR Converged Data Platform

Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams

Page 10: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 10

DATA SOURCES

SMARTUGAPNDBDeath MasterODARDetectsSanction ProviderSEAL-TOPS837iPAPICDB… and more.

Sqoop, NFS DrillHive, Pig

Ingest, ETL, Batch Processing Interactive SQL

UHG Data Lake Architecture Data MovementData Access

Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams

MapR Converged Data Platform

OPERATIONAL APPS, MEMBER 360, CALL CENTER

VISUALIZATION

ANALYTICS

SEARCH

Page 11: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 11

Insight as Service AppsTransunion PRAMA Self-Service Analytics

“What started as an IT-led cost-reduction

project focused on operational savings

has turned into a strategic platform.”

“It’s like a Swiss Army knife. We’re doing

data processing, interactive SQL, and

statistical algorithms on the data. We can

try different avenues and make a business

case. It starts loose and branches out into

new businesses and new revenue

streams.”

Kevin McClowry, Director of Analytic Solution Development

Page 12: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 12

Getting Value from the Data Lake

MAPR-FS: WEB SCALE STORAGE

BI Users

JSON

SQL Users

Parquet Delimited Text

?

Page 13: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 13

Getting Value from the Data Lake

MAPR-FS: WEB SCALE STORAGE

BI Users

JSON Parquet Delimited Text

?

APACHE DRILL

SQL Users

Page 14: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 14

Apache Drill: Unified SQL Layer• In-memory SQL execution engine

• Built from the ground up– Query Hadoop native formats– Leverages Hadoop ecosystem components – YARN, ZooKeeper, Hive

• ANSI SQL syntax to query anything on Hadoop

• Storage plugins– Library of existing storage plugins MapR-DB, MapR-FS, Hive– Custom storage plugins can be developed

• Industry standard APIs– ODBC/JDBC, ANSI SQL makes it easy for BI integration– REST

Page 15: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 15

Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flex

ibilit

y

Page 16: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 16

Improving Performance of Known Workloads

• Interactive needs– Sub-second response times

• Drill is designed for fast interactive SQL workloads– Run them ASAP, fail fast, move on– No assumptions on what lies underneath and what’s coming in

• Schema could change• Files could change• Queries could change

• How do you improve performance of well-understood workloads with high concurrency?

Page 17: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 17

4 Ways to Scale BI on Data Lakes

Steve WooledgePriyank Patel

Page 18: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 18

• Founding team from Teradata, Aster, 3PAR, IBM DB2

• On-cluster visual analytics and BI

• Large production customers in the Fortune 500

• 100s of companies use Arcadia Instant & Arcadia Enterprise

• Access all your data for agile enterprise-wide BI

Create businessvalue fromBig Data

– OUR FOUNDING VISION –

Page 19: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 19

Four Approaches for Big Data Analytics

Scale

Agility

Summary data only.

Move Data to BI Server

Separate BI Server

BI Server

Page 20: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 20

Strategy 1 of 4: Separate BI Server

Move Data to BI Server

BI & Visualization Server

Pros

Least Costly Use existing BI tools

Cons

✘ Shallow insights – summary data✘ Requires IT/DBA: new views & data

movement✘ Separate security models✘ Not real-time: batch data updates✘ No access to unstructured data✘ Heaviest burden on network

Page 21: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 21

Four Approaches for Big Data Analytics

Scale

Agility

Summary data only

Simple SQL. 1-5 users.

Fast SQL + BI Tools

(ODBC/JDBC, Hive, Spark,

Impala, Drill …)

BI Server

Move Data to BI Server

Separate BI Server

BI Server

Page 22: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 22

Strategy 2 of 4: Fast SQL + BI Tools

Pros

Can get detailed data with skilled data scientists/engineers

Performs better then direct connect

Cons

✘ Lower user concurrency✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data

(requires schema)✘ Cost - Manage security in multiple

tools, separate administration for metadata

(ODBC, JDBC, Hive, Spark, Drill, Impala

…)BI & Visualization Server

Page 23: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 23

Four Approaches for Big Data Analytics

Scale

Agility

Static cubes only - No granular data access.

Summary data only

Simple SQL. 1-5 users.

Middleware Application Cubes

Edge Node

Fast SQL + BI Tools

(ODBC/JDBC, Hive, Spark,

Impala, Drill …)

BI Server

Move Data to BI Server

Separate BI Server

BI Server

Page 24: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 24

Strategy 3 of 4: Middleware Application CubesPros

Use existing BI tools Higher user concurrency

Cons

✘ Lacks ad-hoc freedom - Requires IT/DBA for new views

✘ Not real-time: batch data updates✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data

(requires schema)✘ Cost – Multiple tools & data

duplication✘ Increased administration – Separate

security models, administration

Static View on Edge Node

Page 25: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 25

Four Approaches for Big Data Analytics

Scale

Agility

Static cubes only - No granular data access.

Summary data only

Simple SQL. 1-5 users.

Middleware Application Cubes

Edge Node

Fast SQL + BI Tools

(ODBC/JDBC, Hive, Spark,

Impala, Drill …)

BI Server

Move Data to BI Server

Separate BI Server

BI Server

Data-Native Visual Analytics

Native in-Cluster

Real-time & dynamic - 100s to 1000s of users.

Page 26: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 26

Strategy 4 of 4: Data-Native Visual Analytics & Apps

Pros

Greatest user concurrency Linear scalability Agility for analysts (drill to detail) Supports complex data sources Real time In-Hadoop advanced analytics Lowest TCO: simplified architecture

Cons

✘ Newer technology and approach✘ Requires some Hadoop skills to set

up and maintain

Data-Native BI & Analytics

Page 27: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 27

Big Data Analytics: AlternativesCapability Separate

BI ServerHadoop SQL

Engines + BI ToolBig Data “Cubes”

Data-NativeVisual Analytics

Dashboards and reporting ✓ ✓ ✓ ✓Real-time visualizations ✘ ✘ ✘ ✓Data Applications ✘ ✘ ✘ ✓High user concurrency ✓ ✘ ✓ ✓Ad-hoc drill to detail ✘ -- ✘ ✓In-Hadoop advanced analytics(e.g., customer engagement flows, micro-segmentation) ✘ ✘ ✘ ✓Multi-structured data access (e.g. NoSQL, S3, files, search) -- ✓ ✘ ✓Unified Security ✘ ✘ ✘ ✓Unified Administration ✘ ✘ ✘ ✓Lower TCO ✘ ✘ ✘ ✓

Page 28: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 28

Forrester Hadoop Native BI Wave

“Put Your BI right where Your Data Is” Recognizing the need to move BI processing

next to where the data is (in Hadoop) “Other BI vendors will surely follow this trend;

it’s a question of when, not if”

Traditional BI and Visualization tools – MicroStrategy, Tableau, QlikTech are not native to Hadoop

Arcadia Data scored highest (5/5) for Hadoop/Spark architecture

Source: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016

Page 29: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 29

Forrester Wave: Native Hadoop BI Platforms

Data Preparation Capabilities User InterfaceSource: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016

Page 30: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 30

Businesses are Hamstrung with Legacy BI on Data Lakes

Data summarization

Big data fidelity loss

No collaboration

Higher security risk

Operational complexity

High TCO - multiple systems

BI/VIZ TOOLS

BI/SERVER(CUBES)

DATA MART(EXTRACTS)

DATA WAREHOUSE (EXTRACTS)

ALLDATA

<EXTRACTS>

DATA USERS/ANALYSTS

Machine Data OLTPCRM

MarketingAutomation

Product / App Logs Web Logs

Operational Data Sources

Data Lake

Page 31: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 31

Arcadia Makes it Simple

Machine Data OLTPCRM

MarketingAutomation

Product / App Logs Web Logs

Operational Data Sources

Data Lake

NO DATA MOVEMENT

High-definition data analysis

Collaborative, real-time insights on data

More secure

Lower TCO and complexity

Page 32: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 32

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and

Others

MapR Streams

Cloud and

Managed Services

Search and Others

Unified M

anagement and M

onitoring

Event StreamingDatabase

Custom Apps

MapR Converged Data Platform

Analytical Views™

BI &

Viz

Page 33: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 33

The First Data-Native Visual Analytics Platform for Big Data

Arcadia Visualization Engine

Arcadia Analytic Platform(Smart Acceleration™)

On-Premises

Drag-and-drop Visual Analytics & Dashboards

HybridCloud

Custom Data Applications

…BIG DATA OSDistributed execution,

data storage, metadata, security

IN-CLUSTER ANALYTICS ENGINEScales linearly with cluster for speed and easier management

WEB-BASED INTERFACEDrag & drop interface for

visual analytics & app workflow

Data

Pla

tform

Page 34: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 34

Linearly Scale Production Workloads with Smart Acceleration

MapR Converged Data Platform Cluster

Results(100x Faster)

Consumption Layer

Processing Layer

Smart Acceleration™

1. Start with exploration of raw data, no need to determine design of acceleration structures such as cubes ahead of time

2. Recommendation engine generates AVs (derived forms of raw data) based on dynamic data usage within cluster

3. Re-routes data queries to AVs transparently providing automated acceleration when needed for production/high concurrency uses

Automatically modeled and maintained within cluster

Keep logical data models simple without needing to target specific data cube structures

1

2

3Queries

Queries automatically redirected

Analytical Views

Recommendation Engine

Stores Derived Forms of Raw Data in Cluster

Raw Data in MapR

Page 35: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 35

Campaign Analysis ApplicationUnderstand high level metrics with the ability to drill down to details

Augment analysis with a variety of data types & sources such as actual display ad images

Page 36: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 36

Retail Store Geo Analysis

YoY Growth metrics plotted by county for the chose sub-brand

Trellising allows for quick trend analysis across multiple stores.

Here showing store sales vs trade area sales to correlate potential shifts in buying pattern

Choose a specific state to drill down to county level

Page 37: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 37

Retail Stores Drill Down

Interactive maps allows for easy visualization of spatial data zooming into details

Page 38: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 38

PRODUCT DEMONSTRATION

38

Page 39: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 39

Relational, Real Time, or NoSQL Connections

Page 40: 4 Ways to Scale Interactive BI and Analytics on a Data Lake

© 2017 MapR Technologies 40

ResourcesEngage with us!

1. Read the Data Lake & Analytics Ebookhttps://mapr.com/definitive-guide-bi-and-analytics-data-lake/

2. Forrester Wave: Native Hadoop BIwww.arcadiadata.com/lp/forrester-wave-hadoop-bi-research-report/

3. Get Started • MapR Converged Data Platform

https://www.mapr.com/get-started-with-mapr• Arcadia Instant

www.arcadiadata.com/download