4 ways to scale interactive bi and analytics on a data lake

© 2017 MapR Technologies 1

+4 Ways to Scale Interactive BI and Analytics on a Data Lake

Sameer Nori, Saurabh Mahapatra, MapRSteve Wooledge, Priyank Patel, Arcadia DataApril 5th, 2017


Agenda

• Market Trends & Data Lakes

• MapR Platform, Customer Usage & Apache Drill

• The Pros and Cons of Four Big Data BI Methods: BI Servers, Fast SQL Engines, Cubes, and Data Native

• How Arcadia Data Integrates with MapR


MARKET TRENDS AND DATA LAKES


Big Data Deployment Stage

4


What and Why Data Lakes?

• Customers looking to establish next- gen applications/analytics platform

• Capturing large volumes of new data– Machine/App logs– Social data– IoT

• Put all data in play– Near-line storage for cold data– Maintain access and query capability

• Bridging data silos– Aggregating data source across business units

• Regulatory requirements


Increasingly More Intelligent Applications

Scale

Big data storage with commodity economics

Data warehouse offload

Data lakeData hubBatch analytics

Customer 360

Real-time monitoringOperational analytics

IoT monitoring

SIEM

Recommendation enginesAnomaly detectionPredictive analyticsFraud detectionSelf-service analytics

Machine/deep learningHigh-frequency decisions

Connected car

Autonomous driving

Disruptive innovativeapplications

COST REDUCTION

CAPABILITYCOMPLEXITY


MAPR PLATFORM, CUSTOMER USAGE & APACHE DRILL


Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and

Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and

Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and

Others

Event StreamingDatabase

Custom Apps

MapR Converged Data Platform

HDFS API POSIX, NFS Kafka APIHBase API OJAI API


Data Lake Architecture

MapR-DB: time series, structured

data

MapR-FS: emails, blogs, tweets, log files, unstructured

data

NFS/Sqoop/Flume: pure log files

Agile, self-service

data exploration

ETL into operational reporting formats (e.g.,

Parquet)

Multi-tenancy: job/data placement control, volumes

Access controls: file, table, column,

column family, doc, sub-doc

levels

SourcesRELATIONAL, SAAS, MAINFRAME

LOG FILES, CLICKSTREAMS

BLOGS, TWEETS,LINK DATA

Auditing: compliance, analyze

user accesses

Snapshots:track data lineage

and history

Table Replication: global multi-master, business continuity


Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams


DATA SOURCES

SMARTUGAPNDBDeath MasterODARDetectsSanction ProviderSEAL-TOPS837iPAPICDB… and more.

Sqoop, NFS DrillHive, Pig

Ingest, ETL, Batch Processing Interactive SQL

UHG Data Lake Architecture Data MovementData Access

Web-Scale Storage Database Event Streaming MapR-FS MapR-DB MapR Streams


OPERATIONAL APPS, MEMBER 360, CALL CENTER

VISUALIZATION

ANALYTICS

SEARCH


Insight as Service AppsTransunion PRAMA Self-Service Analytics

“What started as an IT-led cost-reduction

project focused on operational savings

has turned into a strategic platform.”

“It’s like a Swiss Army knife. We’re doing

data processing, interactive SQL, and

statistical algorithms on the data. We can

try different avenues and make a business

case. It starts loose and branches out into

new businesses and new revenue

streams.”

Kevin McClowry, Director of Analytic Solution Development


Getting Value from the Data Lake

MAPR-FS: WEB SCALE STORAGE

BI Users

JSON

SQL Users

Parquet Delimited Text

?


Getting Value from the Data Lake

MAPR-FS: WEB SCALE STORAGE

BI Users

JSON Parquet Delimited Text

?

APACHE DRILL

SQL Users


Apache Drill: Unified SQL Layer• In-memory SQL execution engine

• Built from the ground up– Query Hadoop native formats– Leverages Hadoop ecosystem components – YARN, ZooKeeper, Hive

• ANSI SQL syntax to query anything on Hadoop

• Storage plugins– Library of existing storage plugins MapR-DB, MapR-FS, Hive– Custom storage plugins can be developed

• Industry standard APIs– ODBC/JDBC, ANSI SQL makes it easy for BI integration– REST


Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flex

ibilit

y


Improving Performance of Known Workloads

• Interactive needs– Sub-second response times

• Drill is designed for fast interactive SQL workloads– Run them ASAP, fail fast, move on– No assumptions on what lies underneath and what’s coming in

• Schema could change• Files could change• Queries could change

• How do you improve performance of well-understood workloads with high concurrency?


4 Ways to Scale BI on Data Lakes

Steve WooledgePriyank Patel


• Founding team from Teradata, Aster, 3PAR, IBM DB2

• On-cluster visual analytics and BI

• Large production customers in the Fortune 500

• 100s of companies use Arcadia Instant & Arcadia Enterprise

• Access all your data for agile enterprise-wide BI

Create businessvalue fromBig Data

– OUR FOUNDING VISION –


Four Approaches for Big Data Analytics

Scale

Agility

Summary data only.

Move Data to BI Server

Separate BI Server

BI Server


Strategy 1 of 4: Separate BI Server


BI & Visualization Server

Pros

Least Costly Use existing BI tools

Cons

✘ Shallow insights – summary data✘ Requires IT/DBA: new views & data

movement✘ Separate security models✘ Not real-time: batch data updates✘ No access to unstructured data✘ Heaviest burden on network



Scale

Agility

Summary data only

Simple SQL. 1-5 users.

Fast SQL + BI Tools

(ODBC/JDBC, Hive, Spark,

Impala, Drill …)

BI Server


Separate BI Server

BI Server


Strategy 2 of 4: Fast SQL + BI Tools

Pros

Can get detailed data with skilled data scientists/engineers

Performs better then direct connect

Cons

✘ Lower user concurrency✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data

(requires schema)✘ Cost - Manage security in multiple

tools, separate administration for metadata

(ODBC, JDBC, Hive, Spark, Drill, Impala

…)BI & Visualization Server



Scale

Agility

Static cubes only - No granular data access.

Summary data only


Middleware Application Cubes

Edge Node

Fast SQL + BI Tools


Impala, Drill …)

BI Server


Separate BI Server

BI Server


Strategy 3 of 4: Middleware Application CubesPros

Use existing BI tools Higher user concurrency

Cons

✘ Lacks ad-hoc freedom - Requires IT/DBA for new views

✘ Not real-time: batch data updates✘ Lacks in-Hadoop advanced analytics ✘ Cannot access unstructured data

(requires schema)✘ Cost – Multiple tools & data

duplication✘ Increased administration – Separate

security models, administration

Static View on Edge Node



Scale

Agility

Static cubes only - No granular data access.

Summary data only


Middleware Application Cubes

Edge Node

Fast SQL + BI Tools


Impala, Drill …)

BI Server


Separate BI Server

BI Server

Data-Native Visual Analytics

Native in-Cluster

Real-time & dynamic - 100s to 1000s of users.


Strategy 4 of 4: Data-Native Visual Analytics & Apps

Pros

Greatest user concurrency Linear scalability Agility for analysts (drill to detail) Supports complex data sources Real time In-Hadoop advanced analytics Lowest TCO: simplified architecture

Cons

✘ Newer technology and approach✘ Requires some Hadoop skills to set

up and maintain

Data-Native BI & Analytics


Big Data Analytics: AlternativesCapability Separate

BI ServerHadoop SQL

Engines + BI ToolBig Data “Cubes”

Data-NativeVisual Analytics

Dashboards and reporting ✓ ✓ ✓ ✓Real-time visualizations ✘ ✘ ✘ ✓Data Applications ✘ ✘ ✘ ✓High user concurrency ✓ ✘ ✓ ✓Ad-hoc drill to detail ✘ -- ✘ ✓In-Hadoop advanced analytics(e.g., customer engagement flows, micro-segmentation) ✘ ✘ ✘ ✓Multi-structured data access (e.g. NoSQL, S3, files, search) -- ✓ ✘ ✓Unified Security ✘ ✘ ✘ ✓Unified Administration ✘ ✘ ✘ ✓Lower TCO ✘ ✘ ✘ ✓


Forrester Hadoop Native BI Wave

“Put Your BI right where Your Data Is” Recognizing the need to move BI processing

next to where the data is (in Hadoop) “Other BI vendors will surely follow this trend;

it’s a question of when, not if”

Traditional BI and Visualization tools – MicroStrategy, Tableau, QlikTech are not native to Hadoop

Arcadia Data scored highest (5/5) for Hadoop/Spark architecture

Source: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016


Forrester Wave: Native Hadoop BI Platforms

Data Preparation Capabilities User InterfaceSource: Forrester Wave: Native Hadoop BI Platforms, Q3, 2016


Businesses are Hamstrung with Legacy BI on Data Lakes

Data summarization

Big data fidelity loss

No collaboration

Higher security risk

Operational complexity

High TCO - multiple systems

BI/VIZ TOOLS

BI/SERVER(CUBES)

DATA MART(EXTRACTS)

DATA WAREHOUSE (EXTRACTS)

ALLDATA

<EXTRACTS>

DATA USERS/ANALYSTS

Machine Data OLTPCRM

MarketingAutomation

Product / App Logs Web Logs

Operational Data Sources

Data Lake


Arcadia Makes it Simple

Machine Data OLTPCRM

MarketingAutomation

Product / App Logs Web Logs

Operational Data Sources

Data Lake

NO DATA MOVEMENT

High-definition data analysis

Collaborative, real-time insights on data

More secure

Lower TCO and complexity


Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and

Others

MapR Streams

Cloud and

Managed Services

Search and Others

Unified M

anagement and M

onitoring

Event StreamingDatabase

Custom Apps


Analytical Views™

BI &

Viz


The First Data-Native Visual Analytics Platform for Big Data

Arcadia Visualization Engine

Arcadia Analytic Platform(Smart Acceleration™)

On-Premises

Drag-and-drop Visual Analytics & Dashboards

HybridCloud

Custom Data Applications

…BIG DATA OSDistributed execution,

data storage, metadata, security

IN-CLUSTER ANALYTICS ENGINEScales linearly with cluster for speed and easier management

WEB-BASED INTERFACEDrag & drop interface for

visual analytics & app workflow

Data

Pla

tform


Linearly Scale Production Workloads with Smart Acceleration

MapR Converged Data Platform Cluster

Results(100x Faster)

Consumption Layer

Processing Layer

Smart Acceleration™

1. Start with exploration of raw data, no need to determine design of acceleration structures such as cubes ahead of time

2. Recommendation engine generates AVs (derived forms of raw data) based on dynamic data usage within cluster

3. Re-routes data queries to AVs transparently providing automated acceleration when needed for production/high concurrency uses

Automatically modeled and maintained within cluster

Keep logical data models simple without needing to target specific data cube structures

1

2

3Queries

Queries automatically redirected

Analytical Views

Recommendation Engine

Stores Derived Forms of Raw Data in Cluster

Raw Data in MapR


Campaign Analysis ApplicationUnderstand high level metrics with the ability to drill down to details

Augment analysis with a variety of data types & sources such as actual display ad images


Retail Store Geo Analysis

YoY Growth metrics plotted by county for the chose sub-brand

Trellising allows for quick trend analysis across multiple stores.

Here showing store sales vs trade area sales to correlate potential shifts in buying pattern

Choose a specific state to drill down to county level


Retail Stores Drill Down

Interactive maps allows for easy visualization of spatial data zooming into details


PRODUCT DEMONSTRATION

38


Relational, Real Time, or NoSQL Connections


ResourcesEngage with us!

1. Read the Data Lake & Analytics Ebookhttps://mapr.com/definitive-guide-bi-and-analytics-data-lake/

2. Forrester Wave: Native Hadoop BIwww.arcadiadata.com/lp/forrester-wave-hadoop-bi-research-report/

3. Get Started • MapR Converged Data Platform

https://www.mapr.com/get-started-with-mapr• Arcadia Instant

www.arcadiadata.com/download

https://mapr.com/definitive-guide-bi-and-analytics-data-lake/

https://www.arcadiadata.com/lp/forrester-wave-hadoop-bi-research-report/

https://www.arcadiadata.com/lp/forrester-wave-hadoop-bi-research-report/

https://www.mapr.com/get-started-with-mapr



http://www.arcadiadata.com/download

4 ways to scale interactive bi and analytics on a data lake

Data & Analytics