london hug

1©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop


Real-time and Long-time with Storm and Hadoop MapR


Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Hash tag: #mapr_uk

Collective notes: http://bit.ly/JDCRhc

mailto:[email protected]

http://info.mapr.com/ted-uk-05-2012



Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs


Expanding Hadoop Use Cases

NFS for file-based

applications

Hadoop APIsfor HadoopApplications

ODBC (JDBC) for SQL-based applications

Blue = MapR Innovations

Real-time Applications

MissionCritical and SLA

dependent Applications


MapR’s Complete Distribution for Apache Hadoop

MapR Heatmap™

LDAP, NISIntegration

Quotas, Alerts, Alarms

CLI, REST APT

Hive Pig Oozle Sqoop HBase Whirr

Mahout Cascading NaglosIntegration

GangliaIntegration

Flume Zoo-keeper

MapR Control System

DirectAccess

NFS

Real-Time Streaming Volumes Mirrors Snap-

shotsData

Placement

No NameNodeArchitecture

High PerformanceDirect Shuffle

Stateful Failoverand Self Healing

2.7MapR’s Storage Services™

Integrated, tested, hardened andSupported

100% Hadoop, HBase, HDFS API compatible

Easy portability/migration between distributions

Unique advanced features

No changes required to Hadoop applications

Runs on commodity hardware


So what about that real-time stuff?


The Challenge

Hadoop is great of processing vats of data– But sucks for real-time (by design!)

Storm is great for real-time processing– But lacks any way to deal with batch processing

It sounds like there isn’t a solution– Neither fashionable solution handles everything


This is not a problem.

It’s an opportunity!


t

now

Hadoop is Not Very Real-time

UnprocessedData

Fully processed

Latest full period

Hadoop job takes this long for this data


Need to Plug the Hole in Hadoop

We have real-time data with limited state– Exactly what Storm does– And what Hadoop does not

We also have long-term analytics with lots of state– Exactly what Hadoop does– And what Storm does not

Can Storm and Hadoop be combined?


t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View


An Example

I want to know how many queries I get– Per second, minute, day, week

Results should be available– within <2 seconds 99.9+% of the time– within 30 seconds almost always

History should last >3 years Should work for 0.001 q/s up to 100,000 q/s Failure tolerant, yadda, yadda


Rough Design – Data Flow

Search Engine

Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

LoggerBolt

Semi Agg

Hadoop Aggregator

Snap

Long agg

Query Event Spout

Counter Bolt

Logger Bolt


Counter Bolt Detail

Input: Labels to count Output: Short-term semi-aggregated counts– (time-window, label, count)

Input is logged until next flush Non-zero counts emitted on flush if– event count reaches threshold (typical 100K)– time since last count reaches threshold (typical 1-10s)

Tuples acked when counts emitted Double count probability is > 0 but very small


Counter Bolt Counterintuitivity

Counts are emitted for same label, same time window many times– these are semi-aggregated– this is a feature– tuples can be acked within 1s– time windows can be much longer than 1s

No need to send same label to same bolt– speeds failure recovery


Design Flexibility

Tuples can be ack’ed as soon as they hit the log– counter can recover state on failure– log is burn after write

Count flush interval can be extended without extending tuple timeout– Decreases currency of counts in semi-aggregates

Total bandwidth for log is typically not huge– All of twitter @10,000 messages per second = 10K x 2KB = 20MB/s


Counter Bolt No-nos

Cannot accumulate entire period in-memory– Tuples must be ack’ed much sooner– State must be persisted before ack’ing– State can easily grow too large to handle without disk access

Cannot persist entire count table at once – Incremental persistence required


Guarantees

Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise

Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append

HDFS is distributed, but provides no guarantees and strange semantics

MapRfs is distributed, provides all necessary guarantees


Failure Modes

Bolt failure– buffered tuples will go un’acked– after timeout, tuples will be resent– timeout ≈ 10s– if failure occurs after persistence, before acking, then double-counting is

possible Storage (with MapR)– most failures invisible– a few continue within 0-2s, some take 10s– catastrophic cluster restart can take 2-3 min– logger can buffer this much easily


Presentation Layer

Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records

User will see– counts that increment within 0-2 s of events– seamless meld of short and long-term data


Example 2 – Real-time learning

My system has to– learn a response model

and– select training data– in real-time

Data rate up to 100K queries per second


Door Number 3 – AB testing in real-time

I have 15 versions of my landing page Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic


Real-time Constraints

Selection must happen in <20 ms almost all the time Training events must be handled in <20 ms Failover must happen within 5 seconds Client should timeout and back-off– no need for an answer after 500ms

State persistence required


Rough Design

DRPC Spout Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

Model State

Timed Join Model

Logger Bolt

Conversion Detector

Selector Layer


A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?


A First Conclusion

Probability as expressed by humans is subjective and depends on information and experience


A Second Diversion

What is the mass of the moon?– 1/2 degree @ 385 Mm = ~ 3.8 Mm diameter (really about 3.4-ish)– V = 1/6 x pi x 3.83 x 1018 m3 = ~ 29 x 1018 m3 (really about 22)– m = rho V = 4 Mg/m3 x 29 x 1018 m3 = 1.2 x 1023 kg (really about 0.7)

Is that the exact number?– Shouldn’t we have confidence bounds?

Wikipedia says: 7.3477 × 1022 kg– Is that the exact number?– Shouldn’t they have confidence bounds?


A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better


I Dunno


5 and 5


2 and 10


Bayesian Bandit

Compute distributions based on data Sample p1 and p2 from these distributions

Put a coin in bandit 1 if p1 > p2

Else, put the coin in bandit 2


And it works!


Video Demo


The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))

for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)


The Basic Idea

We can encode a distribution by sampling Sampling allows unification of exploration and exploitation

Can be extended to more general response models


Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

mailto:[email protected]




MapR’s Innovations


Thank You

london hug

Technology

mapr innovations

challenge hadoop

paraccel proven mapr

period hadoop job

hadoop community contributions

hadoop applications

realtime processing

log counter