realtime analytics with_hadoop
TRANSCRIPT
Welcome to Today’s
DBTA Roundtable Discussion
Moderator
Stephen Faig
Manager
Unisphere Research and DBTA
Real-Time Analytics with Hadoop
Speakers
Dale Kim
Director of Industry Solutions
MapR
Paige Roberts
Hadoop & Analytics Evangelist
Actian
© 2015 MapR Technologies 5© 2015 MapR Technologies
© 2015 MapR Technologies 6
Examples of Real-Time
Images licensed under https://creativecommons.org/licenses/by/2.0/
Time image courtesy of Daniel Oldfield: https://www.flickr.com/photos/democlez/4424898002/
Air bag image courtesy of Mike Babcock: https://www.flickr.com/photos/mikebabcock/3098836311/
Tied to clock time Guaranteed response time
For real-time analytics, let’s use: “no built-in delays”
So what is real-time analytics with Hadoop?
© 2015 MapR Technologies 7
Requirements for Real-Time Analytics with Hadoop
REAL-TIME DATA
REAL-TIME APPLICATIONS
REAL-TIME QUERIES
© 2015 MapR Technologies 8
Real-Time Data
Definition: Provide immediate access to live Hadoop data
for analysis
Requirements:
• Analysis uses live real-time data, not batch-copied data
• Business can identify insights immediately (often through
an automated process)• Critical for use cases such as ad targeting, personalization, network
security analysis.
• System avoids complexity of separate stream processing
or messaging system for recent data
© 2015 MapR Technologies 9
Real-Time Data in Hadoop
For real-time:
• Log files should be written directly
into the cluster or synced across
remote data centers
• Operational applications should
run in the same cluster, or in a
separate cluster with real-time
table replication
• Immediate action should be taken
• E.g., difference between fraud
detection and fraud prevention
• Difference between on-demand ad
bid versus missing opportunity
Existing challenges:
• Log files must be batch uploaded
periodically (e.g., every 30
minutes)
• Due to HDFS limitations (not R/W,
file-close semantics, no direct NFS)
• Operational applications run on a
separate cluster/stack
• Data must be batch uploaded
• With batch uploads, the window to
respond is missed
• Fraud, cyber attacks, matches,
anomalies, etc.
© 2015 MapR Technologies 10
Real-Time Applications
Definition: Run operational applications in the cluster
Requirements:
• Address use cases beyond batch and interactive
analysis• E.g., end-to-end real-time marketing and security applications directly
on Hadoop
• Eliminate separate Hadoop and NoSQL
clusters/technology stacks for apps
© 2015 MapR Technologies 11
Real-Time Applications in Hadoop
For real-time:
• Minimize impact of disrupting
“housekeeping tasks to enable
consistent, real-time operations
• E.g., Compactions, Java garbage
collection, “region splits”
• Process live, operational data in
Hadoop to avoid delays in batch
copies
Existing challenges:
• Other in-Hadoop databases suffer
disruptions, inhibiting real-time
• E.g., Compactions can significantly
slow down the system
• Garbage collection leads to
unpredictable system delays
• Region splits are required to spread
load, but impacts responsiveness
and performance
• Other in-Hadoop databases require
separate clusters
© 2015 MapR Technologies 12
Real-Time Querying
Definition: Query any data as soon as it lands in the
cluster (self-service)
Requirements:
• Analysts can explore data immediately, no waiting
days/weeks for data prep by IT
• IT is not burdened with repeated schema management
and ETL requests
© 2015 MapR Technologies 13
Real-Time Querying in Hadoop
For real-time:
• Minimize time to get started on
data exploration
• Leverage query engines that can
query data in place
– Eliminate IT dependencies for
schema preparation
Existing challenges:
• New data that lands in the cluster
necessarily requires IT-built
schemas
• Data exploration and analysis is
contingent on IT backlog
© 2015 MapR Technologies 14© 2015 MapR Technologies
So How Are These Implemented?
© 2014 MapR Technologies 15
Fraud modelRecommendations
table
MapR Distribution for Hadoop
Fraud
investigator
Interactive
marketer
Online
transactions
Fraud
detection
Personalized
offers
Clickstream
analysis
Fraud
investigation tool
Real-time Operational Applications
Analytics
Case Study: Global Financial Services FirmAnalytics + Operational Applications on one platform
© 2015 MapR Technologies 16
REAL-TIME DATA
REAL-TIME APPLICATIONS
REAL-TIME QUERIES
© 2015 MapR Technologies 17
Faster/Secure NFS Access
Redundant gateways
for high availability
CLIENT NODE(S)
NFS
Gateway
NFS
Gateway
MapR data access options:
1. HDFS API – apps written for Hadoop
2. Standard read/write NFS (POSIX) – existing
file system-based apps, no code changes
3. MapR POSIX Client – advanced read/write
NFS requirements, includes:
1. Compression
2. Parallelism
3. Authentication
4. Encryption
NFS client(included in OS)
Native applications
HDFS API(hadoop-core-*.jar)
MapR POSIX
Client
MapR cluster
Hadoop
applications(e.g. “hadoop fs –put”)
File-based apps/utils(e.g. cp, emacs)
NFS client(included in OS)
NFS
Gateway
2
3
1
© 2015 MapR Technologies 18
YCSB
BenchmarkMapR-DB 4.X Other NoSQL
MapR-DB
Increase
Load
(10, 100)*27,097 14,753 1.8x
Read
(75, 150)4,402 1,902 2.3x
50% read /
50% update
(75, 100)
8,684 2,012 4.3x
95% read /
5% update
(75, 100)
3,776 1,127 3.4x
Scan
(32, 32)478 Client hangs N/A
MapR-DB and “Other NoSQL” Throughput on YCSB
Throughput performance in operations/second/node (higher is better)
*Numbers in parentheses represent threads per client used in test runs for MapR-DB, other NoSQL, respectively
© 2015 MapR Technologies 19
REAL-TIME DATA
REAL-TIME APPLICATIONS
REAL-TIME QUERIES
© 2015 MapR Technologies 20
YCSB Mixed (50% Read / 50% Put) - Compare Read Latency
MapR-DB
HBase on other
Hadoop distribution
Lower is better
© 2015 MapR Technologies 21
MapR-DB Table Replication
Multi-master (aka, active/active)
replication
Active Read/Write
End Users
• Faster data access – minimize network
latency on global data with local clusters
• Reduced risk of data loss – real-time,
bi-directional replication for synchronized
data across active clusters
• Application failover – upon any cluster
failure, applications continue via
redirection to another cluster
© 2015 MapR Technologies 22
MapR-DB Real-Time Analytics
Active clusters close to the end users,
with real-time analytics at central cluster
Active Read/Write
MapR-DB cluster
(London)
MapR-DB cluster
(New York)
MapR-DB cluster
(Singapore)
MapR-DB/Hadoop
cluster
Hadoop analytics
Operational and analytical workloads
combined in a single deployment
Operationally efficient,
consolidated MapR cluster
Database
operations
Hadoop
analytics
End Users
© 2015 MapR Technologies 23
REAL-TIME DATA
REAL-TIME APPLICATIONS
REAL-TIME QUERIES
© 2015 MapR Technologies 24
One SQL Interface for All Data Formats
Unstructured data will
account for more than 80%
of the data collected by
organizations
ANSI SQL queries on rapidly evolving schemas
UNSTRUCTURED DATA
STRUCTURED DATA
2000 20101990 2020
To
tal D
ata
Sto
red Existing
SQL
Engines
Apache
Drill
Self-Service
Data
Exploration
IT-Driven BI
Self-Service BI
SQL Options for
Analytics
© 2015 MapR Technologies 25
Traditional
Approach
Agility by Reducing Distance to DataShort analytic life cycles with no upfront schema creation and management
Hadoop DataSchema Design
Transformation
Data Movement
Users
Hadoop Data Users
New Business Questions
Total Time to Value: Weeks to Months
Total Time to Value: Minutes
New
Approach
Data Preparation
New Business Questions
Drill enables the
“As It Happens” business
with instant SQL analytics
on complex data
FROM:
TO:
© 2015 MapR Technologies 26© 2015 MapR Technologies
Summary
© 2015 MapR Technologies 27
Batch Bottlenecks
1. Data streaming – real-time,
but…
2. Further analysis is limited by
batch loads into HDFS
3. Most databases must run in
separate cluster, leading to
batch copies
4. Append-only HDFS leads to
heavy I/O for database
defragmentation
(“compactions”)
5. Data exploration requires IT
intervention
1
2
3
4
5
© 2015 MapR Technologies 28
Removing the Batch Limitations
1. Data streaming – real-time
as before, and now….
2. Further analysis is allowed
with real-time loading
3. MapR-DB runs in Hadoop
4. With full read/write file
system, defragmentation
delays are eliminated
5. Data exploration performed
in a self-service manner
Real-Time
Data
Real-Time
Applications
Real-Time
Querying
1
2
345
© 2015 MapR Technologies 29
And Don’t Forget…
• Real-time analytics doesn’t help you if the other key pieces
aren’t in place
• Include security
– Interoperability with any authentication mechanism
– Fine-grained access controls
– Auditing capabilities beyond simple log files
• Also include enterprise-grade reliability
– An automated high availability configuration
– Incremental mirroring/replication for disaster recovery
– Consistent snapshots
• Talk to us about what else you should consider
© 2015 MapR Technologies 30
Q & A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies
Confidential © 2014 Actian Corporation31
Real-Time Analytics
Paige Roberts
April, 2015
Hadoop & Analytics Evangelist Actian Hadoop & Analytics Center of Excellence
Confidential © 2014 Actian Corporation32
Agenda
About Actian
Advantages of Data-Driven Business
What Do I Mean By Real-Time?
Real-Time Challenge: ATM Fraud
How Actian Does It
Confidential © 2014 Actian Corporation33
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model 33
“Fast becoming a big data
powerhouse to challenge
the market.” Forrester
“Actian is now very powerfully
positioned in the big data and analytics
markets.” Bloor
A Few Words About Actian
Confidential © 2014 Actian Corporation34
Note: Percentage, 10 year CAGR McKinsey Report on Big Data.
8
9
5
5
-1
6
9
14
11
9
24
12
Revenue
Big Data Other Companies
Grocers
Online Retailers
Big Box Retailers
Casinos
Credit Cards
Insurance
EBITDA
• Predictive
• Real-time
• All Data
• New Insights
• Accuracy
5
-1
1
2
-15
3
14
9
12
10
22
11
…. At the Expense of Those That Don’t
Companies Using Big Data Strategically Outperform
Confidential © 2014 Actian Corporation35
What Does Real-Time Mean to Us?
Human comfortable interactivity
Streaming data processing
Sub-second response
Confidential © 2014 Actian Corporation36
Real-Time Analytics – Many Applications
Solar Power Company New customer targetingSmart meter data
Sportswear CompanyBrand loyaltyWearable data
BankATM FraudRouter data
Confidential © 2014 Actian Corporation37
Large US Bank Needs Help
• Multi-billion dollar American
bank / financial holding
company
• Provides deposit, credit,
trust, and investment
services to a broad range of
clients
• Operates nearly 1,500 retail
branches and more than
2,000 ATMs
Confidential © 2014 Actian Corporation38
Nu
mb
er
of tim
es fa
ste
r th
an
Imp
ala
Fraud Kept This Bank’s Execs Up at Night
Confidential © 2014 Actian Corporation39
What is the Worst Gotcha About ATM Fraud?
In spite of that, 67% of U.S. adults would
switch to another institution after
experiencing ATM fraud or a data breach.
http://www.harrisinteractive.com/NewsRoom/HarrisPolls/tabid/447/ctl/ReadCustom%20Default/mid/1508/ArticleId/1515/Default.aspx
In the majority of cases, banks are required
to reimburse customer losses.
https://www.tycois.com/insights-and-opinions/articles/atm-skimming-costs-banks
Confidential © 2014 Actian Corporation40
This is What You Call a Delayed Reaction
Confidential © 2014 Actian Corporation41
Time to Call in the Elephant
Confidential © 2014 Actian Corporation42
Actian Management Console
DA
TA
P
LA
TF
OR
M
Actian Vortex
Elastic Data Preparation
DataFlow
SQL Analytics
Vector in Hadoop
Library of Analytic Blueprints
Graph Analytics
SPARQLverse
Machine Learning & Predictive Analytics
DataFlow
ANALYTIC
APPS
Financial
Services
Health Care
Other
Verticals
S Q
LJava, C
/++
,
Pyth
on
SOURCE
DATA
Databases / Marts
Warehouses
Cloud / SaaS
Applications
Structured &
Unstructured
Data
Enterprise
Applications
AP
PLI
CA
TIO
N D
EV
Application Development and Tools
INFR
AST
RU
CTU
RE
Deployment Options
@Customer
Actian Vortex: The Elephant’s Best Friend
powered by KNIME
Confidential © 2014 Actian Corporation43
Actian Vortex: High Performance Analytics at Scale in Hadoop
Powered by KNIME
Confidential © 2014 Actian Corporation44
Stopping Fraud in Real Time
https://www.youtube.com/watch?v=u1QoHCpOUOU
Confidential © 2014 Actian Corporation45
Actian Vector in Hadoop: Built for Speed
Vector Processing
Single
Instruction
Multiple
Data
2nd Gen Column Store
Limit I/O
Efficient real time updates
Smarter Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
Multi-core ParallelismMaximize system resource
utilization…
Storage Indexes
Quickly identify candidate data
blocks
Minimize IO
5
6
Tim
e / C
yc
les
to
Pro
ce
ss
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-2
0150-2
50
Mill
ions
Confidential © 2014 Actian Corporation46
How Fast?
Confidential © 2014 Actian Corporation47
How Fast?
Confidential © 2014 Actian Corporation48
What to Look For in SQL in Hadoop
• Collaborative architecture• Open access to Actian data
storage formats • Support for other formats• Hadoop distribution and
ecosystem application support
No vendor lock-in
• Fastest data prep and ingestion
• Fastest analytic engines• Unbridled processing
power on data nodes in a Hadoop cluster
• Full SQL support• Extreme scalability• Full security• High Availability &
Disaster Recovery
Results you need when you need them
Proven technology advantages
Open Fast Enterprise-Grade
Confidential © 2014 Actian Corporation49
Free Actian Vortex Express Edition
Confidential © 2014 Actian Corporation50
www.actian.com
facebook.com/actiancorp
@actiancorp
Thank You
Download Actian Vortex ExpressFree Forever
http://bigdata.actian.com/sql-in-hadoop
Question and Answer Session
(please submit questions)
Q & A
Dale Kim
Director of Industry Solutions
MapR
Paige Roberts
Hadoop & Analytics Evangelist
Actian
Please use the same URL you used to view today’s live event
for the archive event, plus we will be sending you a follow-up
email with that URL once the archive is posted!
Thank you for participating in
today’s roundtable web event
Just by attending this event the winner of the
$100 AmEx Gift Card is…….