big data infrastructure - github pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · cs...

Big Data Infrastructure

Week 12: Real-Time Data Analytics (1/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States��See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 489/698 Big Data Infrastructure (Winter 2016)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

March 29, 2016

These slides are available at http://lintool.github.io/bigdata-2016w/

OLTP/OLAP Architecture

OLTP OLAPETL��

(Extract, Transform, and Load)

Twitter’s data warehousing architectureWhat’s the issue?

real-time vs.

online vs.

streaming

What is a data stream?¢  Sequence of items:

l  Structured (e.g., tuples)l  Ordered (implicitly or timestamped)

l  Arriving continuously at high volumes

l  Sometimes not possible to store entirely

l  Sometimes not possible to even examine all items

What to do with data streams?¢  Network traffic monitoring

¢  Datacenter telemetry monitoring

¢  Sensor networks monitoring

¢  Credit card fraud detection

¢  Stock market analysis

¢  Online mining of click streams

¢  Monitoring social media streams

What’s the scale? Packet data streams¢  Single 2 Gb/sec link; say avg. packet size is 50 bytes

l  Number of packets/sec = 5 million l  Time per packet = 0.2 microseconds

¢  If we only capture header information per packet: ��source/destination IP, time, no. of bytes, etc. – at least 10 bytesl  50 MB per second

l  4+ TB per day

l  Per link!

What if you wanted to do deep-packet inspection?Source: Minos Garofalakis, Berkeley CS 286

Converged IP/MPLS Network

PSTN

DSL/Cable Networks

Enterprise Networks

Network Operations Center (NOC)

BGP Peer

R1 R2 R3

What are the top (most frequent) 1000 (source, dest) pairs seen by R1 over the last month?

SELECT COUNT (R1.source, R1.dest) FROM R1, R2 WHERE R1.source = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?

Set-Expression Query

Off-line analysis – Data access is slow, expensive

DBMS

Source: Minos Garofalakis, Berkeley CS 286

Common Architecture

¢  Data stream management system (DSMS) at observation pointsl  Voluminous streams-in, reduced streams-out

¢  Database management system (DBMS)l  Outputs of DSMS can be treated as data feeds to databases

DSMS

DSMS

DBMS

data streamsqueries

queriesdata feeds

Source: Peter Bonz

OLTP/OLAP Architecture

OLTP OLAPETL��

(Extract, Transform, and Load)

DBMS vs. DSMS

DBMSl  Model: persistent relations l  Relation: tuple set/bag

l  Data update: modifications

l  Query: transient

l  Query answer: exactl  Query evaluation: arbitrary

l  Query plan: fixed

DSMSl  Model: (mostly) transient relations l  Relation: tuple sequence

l  Data update: appends

l  Query: persistent

l  Query answer: approximatel  Query evaluation: one pass

l  Query plan: adaptive

Source: Peter Bonz

What makes it hard?¢  Intrinsic challenges:

l  Volumel  Velocity

l  Limited storage

l  Strict latency requirements

¢  System challenges:l  Load balancing

l  Unreliable and out-of-order message deliveryl  Fault-tolerance

l  Consistency semantics (at most once, exactly once, at least once)

What exactly do you do?¢  “Standard” relational operations:

l  Selectl  Project

l  Transform (i.e., apply custom UDF)

l  Group by

l  Joinl  Aggregations

¢  What else do you need to make this “work”?

Issues of Semantics¢  Group by… aggregate

l  When do you stop grouping and start aggregating?

¢  Joining a stream and a static sourcel  Simple lookup

¢  Joining two streamsl  How long do you wait for the join key in the other stream?

¢  Joining two streams, group by and aggregationl  When do you stop joining?

What’s the solution?

Windows¢  Mechanism for extracting finite relations from an infinite stream

¢  Windows restrict processing scope:

l  Windows based on ordering attributes (e.g., time) l  Windows based on item (record) counts

l  Windows based on explicit markers (e.g., punctuations)

l  Variants (e.g., some semantic partitioning constraint)

Windows on Ordering Attributes¢  Assumes the existence of an attribute that defines the order of

stream elements (e.g., time)

¢  Let T be the window size in units of the ordering attribute

t1 t2 t3 t4 t1' t2’ t3’ t4’

t1 t2 t3

sliding window

tumbling window

ti’ – ti = T

ti+1 – ti = T

Source: Peter Bonz

Windows on Counts¢  Window of size N elements (sliding, tumbling) over the stream

¢  Challenges:

l  Problematic with non-unique timestamps: non-deterministic outputl  Unpredictable window size (and storage requirements)

t1 t2 t3 t1' t2’ t3’ t4’

Source: Peter Bonz

Windows from “Punctuations”¢  Application-inserted “end-of-processing”

l  Example: stream of actions… “end of user session”

¢  Propertiesl  Advantage: application-controlled semantics

l  Disadvantage: unpredictable window size (too large or too small)

Common Techniques

Source: Wikipedia (Forge)

“Hello World” Stream Processing¢  Problem:

l  Count the frequency of items in the stream

¢  Why?l  Take some action when frequency exceeds a threshold

l  Data mining: raw counts → co-occurring counts → association rules

The Raw Stream…

stream

items

Source: Peter Bonz

Divide Into Windows…

window 1 window 2 window 3

Source: Peter Bonz

First Window

Source: Peter Bonz

Second Window

Next Window

+

FrequencyCounts

second window

frequency counts

FrequencyCounts

frequency counts

Source: Peter Bonz

Window Counting¢  What’s the issue?

Lessons learned?Solutions are approximate (or lossy)

General Strategies¢  Sampling

¢  Hashing

Reservoir Sampling¢  Task: select s elements from a stream of size N with uniform

probabilityl  N can be very very large

l  We might not even know what N is! (infinite stream)

¢  Solution: Reservoir samplingl  Store first s elements

l  For the k-th element thereafter, keep with probability s/k ��(randomly discard an existing element)

¢  Example: s = 10l  Keep first 10 elements

l  11th element: keep with 10/11l  12th element: keep with 10/12

l  …

Reservoir Sampling: How does it work?¢  Example: s = 10

l  Keep first 10 elementsl  11th element: keep with 10/11

¢  General case: at the (k + 1)th elementl  Probability of selecting each item up until now is s/k

l  Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1)

l  Probability existing item survives: k/(k + 1)

l  Probability each item survives to (k + 1)th round: ��(s/k) × k/(k + 1) = s/(k + 1)

If we decide to keep it: sampled uniformly by definitionprobability existing item is discarded: 10/11 × 1/10 = 1/11probability existing item survives: 10/11

Hashing for Three Common Tasks¢  Cardinality estimation

l  What’s the cardinality of set S?l  How many unique visitors to this page?

¢  Set membershipl  Is x a member of set S?

l  Has this user seen this ad before?

¢  Frequency estimationl  How many times have we observed x?

l  How many queries has this user issued?

HashSet

HashSet

HashMap

HLL counter

Bloom Filter

CMS

HyperLogLog Counter¢  Task: cardinality estimation of set

l  size() → number of unique elements in the set

¢  Observation: hash each item and examine the hash codel  On expectation, 1/2 of the hash codes will start with 1

l  On expectation, 1/4 of the hash codes will start with 01



l  …

How do we take advantage of this observation?

Bloom Filters¢  Task: keep track of set membership

l  put(x) → insert x into the setl  contains(x) → yes if x is a member of the set

¢  Componentsl  m-bit bit vector

l  k hash functions: h1 … hk

0 0 0 0 0 0 0 0 0 0 0 0

Bloom Filters: put

0 0 0 0 0 0 0 0 0 0 0 0

xput h1(x) = 2h2(x) = 5h3(x) = 11

Bloom Filters: put

0 1 0 0 1 0 0 0 0 0 1 0

xput

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

xcontains h1(x) = 2h2(x) = 5h3(x) = 11


0 1 0 0 1 0 0 0 0 0 1 0

xcontains h1(x) = 2h2(x) = 5h3(x) = 11

AND = YES A[h1(x)]A[h2(x)]A[h3(x)]


0 1 0 0 1 0 0 0 0 0 1 0

ycontains h1(y) = 2h2(y) = 6h3(y) = 9


0 1 0 0 1 0 0 0 0 0 1 0

ycontains h1(y) = 2h2(y) = 6h3(y) = 9

What’s going on here?

AND = NO A[h1(y)]A[h2(y)]A[h3(y)]

Bloom Filters¢  Error properties: contains(x)

l  False positives possiblel  No false negatives

¢  Usage:l  Constraints: capacity, error probability

l  Tunable parameters: size of bit vector m, number of hash functions k

Count-Min Sketches¢  Task: frequency estimation

l  put(x) → increment count of x by onel  get(x) → returns the frequency of x

¢  Componentsl  k hash functions: h1 … hk

l  m by k array of counters

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

m

k

Count-Min Sketches: put

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4


0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

xput


0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4


0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

xput


0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

yput h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2


0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yput

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4


0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4

A[h3(x)]MIN = 2

A[h1(x)]A[h2(x)]

A[h4(x)]


0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2


0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2

MIN = 1 A[h3(y)]

A[h1(y)]A[h2(y)]

A[h4(y)]

Count-Min Sketches¢  Error properties:

l  Reasonable estimation of heavy-hittersl  Frequent over-estimation of tail

¢  Usage:l  Constraints: number of distinct events, distribution of events, error

boundsl  Tunable parameters: number of counters m, number of hash functions k,

size of counters

Three Common Tasks¢  Cardinality estimation

l  What’s the cardinality of set S?l  How many unique visitors to this page?

¢  Set membershipl  Is x a member of set S?

l  Has this user seen this ad before?

¢  Frequency estimationl  How many times have we observed x?

l  How many queries has this user issued?

HashSet

HashSet

HashMap

HLL counter

Bloom Filter

CMS

Source: Wikipedia (River)

Stream Processing ArchitecturesNext time:

Source: Wikipedia (Japanese rock garden)

Questions?

big data infrastructure - github pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · cs...

Documents