big data infrastructure - github pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · cs...
TRANSCRIPT
Big Data Infrastructure
Week 12: Real-Time Data Analytics (1/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 489/698 Big Data Infrastructure (Winter 2016)
Jimmy LinDavid R. Cheriton School of Computer Science
University of Waterloo
March 29, 2016
These slides are available at http://lintool.github.io/bigdata-2016w/
OLTP/OLAP Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
Twitter’s data warehousing architectureWhat’s the issue?
real-time vs.
online vs.
streaming
What is a data stream?¢ Sequence of items:
l Structured (e.g., tuples)l Ordered (implicitly or timestamped)
l Arriving continuously at high volumes
l Sometimes not possible to store entirely
l Sometimes not possible to even examine all items
What to do with data streams?¢ Network traffic monitoring
¢ Datacenter telemetry monitoring
¢ Sensor networks monitoring
¢ Credit card fraud detection
¢ Stock market analysis
¢ Online mining of click streams
¢ Monitoring social media streams
What’s the scale? Packet data streams¢ Single 2 Gb/sec link; say avg. packet size is 50 bytes
l Number of packets/sec = 5 million l Time per packet = 0.2 microseconds
¢ If we only capture header information per packet: ���source/destination IP, time, no. of bytes, etc. – at least 10 bytesl 50 MB per second
l 4+ TB per day
l Per link!
What if you wanted to do deep-packet inspection?Source: Minos Garofalakis, Berkeley CS 286
Converged IP/MPLS Network
PSTN
DSL/Cable Networks
Enterprise Networks
Network Operations Center (NOC)
BGP Peer
R1 R2 R3
What are the top (most frequent) 1000 (source, dest) pairs seen by R1 over the last month?
SELECT COUNT (R1.source, R1.dest) FROM R1, R2 WHERE R1.source = R2.source
SQL Join Query
How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?
Set-Expression Query
Off-line analysis – Data access is slow, expensive
DBMS
Source: Minos Garofalakis, Berkeley CS 286
Common Architecture
¢ Data stream management system (DSMS) at observation pointsl Voluminous streams-in, reduced streams-out
¢ Database management system (DBMS)l Outputs of DSMS can be treated as data feeds to databases
DSMS
DSMS
DBMS
data streamsqueries
queriesdata feeds
Source: Peter Bonz
OLTP/OLAP Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
DBMS vs. DSMS
DBMSl Model: persistent relations l Relation: tuple set/bag
l Data update: modifications
l Query: transient
l Query answer: exactl Query evaluation: arbitrary
l Query plan: fixed
DSMSl Model: (mostly) transient relations l Relation: tuple sequence
l Data update: appends
l Query: persistent
l Query answer: approximatel Query evaluation: one pass
l Query plan: adaptive
Source: Peter Bonz
What makes it hard?¢ Intrinsic challenges:
l Volumel Velocity
l Limited storage
l Strict latency requirements
¢ System challenges:l Load balancing
l Unreliable and out-of-order message deliveryl Fault-tolerance
l Consistency semantics (at most once, exactly once, at least once)
What exactly do you do?¢ “Standard” relational operations:
l Selectl Project
l Transform (i.e., apply custom UDF)
l Group by
l Joinl Aggregations
¢ What else do you need to make this “work”?
Issues of Semantics¢ Group by… aggregate
l When do you stop grouping and start aggregating?
¢ Joining a stream and a static sourcel Simple lookup
¢ Joining two streamsl How long do you wait for the join key in the other stream?
¢ Joining two streams, group by and aggregationl When do you stop joining?
What’s the solution?
Windows¢ Mechanism for extracting finite relations from an infinite stream
¢ Windows restrict processing scope:
l Windows based on ordering attributes (e.g., time) l Windows based on item (record) counts
l Windows based on explicit markers (e.g., punctuations)
l Variants (e.g., some semantic partitioning constraint)
Windows on Ordering Attributes¢ Assumes the existence of an attribute that defines the order of
stream elements (e.g., time)
¢ Let T be the window size in units of the ordering attribute
t1 t2 t3 t4 t1' t2’ t3’ t4’
t1 t2 t3
sliding window
tumbling window
ti’ – ti = T
ti+1 – ti = T
Source: Peter Bonz
Windows on Counts¢ Window of size N elements (sliding, tumbling) over the stream
¢ Challenges:
l Problematic with non-unique timestamps: non-deterministic outputl Unpredictable window size (and storage requirements)
t1 t2 t3 t1' t2’ t3’ t4’
Source: Peter Bonz
Windows from “Punctuations”¢ Application-inserted “end-of-processing”
l Example: stream of actions… “end of user session”
¢ Propertiesl Advantage: application-controlled semantics
l Disadvantage: unpredictable window size (too large or too small)
Common Techniques
Source: Wikipedia (Forge)
“Hello World” Stream Processing¢ Problem:
l Count the frequency of items in the stream
¢ Why?l Take some action when frequency exceeds a threshold
l Data mining: raw counts → co-occurring counts → association rules
The Raw Stream…
stream
items
Source: Peter Bonz
Divide Into Windows…
window 1 window 2 window 3
Source: Peter Bonz
First Window
Source: Peter Bonz
Second Window
Next Window
+
FrequencyCounts
second window
frequency counts
FrequencyCounts
frequency counts
Source: Peter Bonz
Window Counting¢ What’s the issue?
Lessons learned?Solutions are approximate (or lossy)
General Strategies¢ Sampling
¢ Hashing
Reservoir Sampling¢ Task: select s elements from a stream of size N with uniform
probabilityl N can be very very large
l We might not even know what N is! (infinite stream)
¢ Solution: Reservoir samplingl Store first s elements
l For the k-th element thereafter, keep with probability s/k ���(randomly discard an existing element)
¢ Example: s = 10l Keep first 10 elements
l 11th element: keep with 10/11l 12th element: keep with 10/12
l …
Reservoir Sampling: How does it work?¢ Example: s = 10
l Keep first 10 elementsl 11th element: keep with 10/11
¢ General case: at the (k + 1)th elementl Probability of selecting each item up until now is s/k
l Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1)
l Probability existing item survives: k/(k + 1)
l Probability each item survives to (k + 1)th round: ���(s/k) × k/(k + 1) = s/(k + 1)
If we decide to keep it: sampled uniformly by definitionprobability existing item is discarded: 10/11 × 1/10 = 1/11probability existing item survives: 10/11
Hashing for Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
HyperLogLog Counter¢ Task: cardinality estimation of set
l size() → number of unique elements in the set
¢ Observation: hash each item and examine the hash codel On expectation, 1/2 of the hash codes will start with 1
l On expectation, 1/4 of the hash codes will start with 01
l On expectation, 1/8 of the hash codes will start with 001
l On expectation, 1/16 of the hash codes will start with 0001
l …
How do we take advantage of this observation?
Bloom Filters¢ Task: keep track of set membership
l put(x) → insert x into the setl contains(x) → yes if x is a member of the set
¢ Componentsl m-bit bit vector
l k hash functions: h1 … hk
0 0 0 0 0 0 0 0 0 0 0 0
Bloom Filters: put
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11
Bloom Filters: put
0 1 0 0 1 0 0 0 0 0 1 0
xput
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
AND = YES A[h1(x)]A[h2(x)]A[h3(x)]
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
What’s going on here?
AND = NO A[h1(y)]A[h2(y)]A[h3(y)]
Bloom Filters¢ Error properties: contains(x)
l False positives possiblel No false negatives
¢ Usage:l Constraints: capacity, error probability
l Tunable parameters: size of bit vector m, number of hash functions k
Count-Min Sketches¢ Task: frequency estimation
l put(x) → increment count of x by onel get(x) → returns the frequency of x
¢ Componentsl k hash functions: h1 … hk
l m by k array of counters
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
m
k
Count-Min Sketches: put
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
xput
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
yput h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
Count-Min Sketches: put
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yput
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
A[h3(x)]MIN = 2
A[h1(x)]A[h2(x)]
A[h4(x)]
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
MIN = 1 A[h3(y)]
A[h1(y)]A[h2(y)]
A[h4(y)]
Count-Min Sketches¢ Error properties:
l Reasonable estimation of heavy-hittersl Frequent over-estimation of tail
¢ Usage:l Constraints: number of distinct events, distribution of events, error
boundsl Tunable parameters: number of counters m, number of hash functions k,
size of counters
Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
Source: Wikipedia (River)
Stream Processing ArchitecturesNext time:
Source: Wikipedia (Japanese rock garden)
Questions?