lahar: extracting events from probabilistic streams
DESCRIPTION
LAHAR: Extracting Events from Probabilistic Streams. Chris Re, Julie Letchner , Magdalena Balazinska and Dan Suciu University of Washington. What is a Lahar ?. This is a Lahar. It’s a massive, fast stream of dirt(y data). - PowerPoint PPT PresentationTRANSCRIPT
LAHAR: Extracting Events from Probabilistic Streams
Chris Re, Julie Letchner,
Magdalena Balazinska and Dan Suciu
University of Washington
Lahar -- SIGMOD 2008 -- Christopher Re2
What is a Lahar?
This is a Lahar
May 18, 1980 ~ 8:27am … a few minutes later
It’s a massive, fast stream of dirt(y data)
Our system, Lahar, processes queries on massive, dirty streams of data
Event Queries
Lahar -- SIGMOD 2008 -- Christopher Re
3
C B
A
DE
Motivating App: RFID Event queries as Cayuga, Sase and Snoop
Complex sequences using projections, predicates,…
Joe entered office 422 at t=8
Query: “Alert when Joe enters 422”
i.e. Joe outside 422, inside 422
Lahar -- SIGMOD 2008 -- Christopher Re4
Challenges: Tracking Joe’s Location
6th Floor in CS building
Blue ring is Joe’s Location
Antennas
Lahar -- SIGMOD 2008 -- Christopher Re5
6th Floor in CS building
Challenges: Tracking Joe’s Location
Blue ring is Joe’s Location
Antennas Two Problems:1. Missed Readings2. Granularity Mismatch
Propose: infer location, keep probs & query with Lahar Model Based View [Deshpande et al] of an HMM
Lahar retains probabilities, achieves higher quality (P/R) and is still efficient.
Lahar -- SIGMOD 2008 -- Christopher Re6
Outline RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Lahar -- SIGMOD 2008 -- Christopher Re7
Tracking Joe’s Location
Blue ring is ground truth
Antennas 6th Floor in CS building
Lahar -- SIGMOD 2008 -- Christopher Re8
Probabilities via particle filter
Each orange particle is a guess of Joe’s location
Blue ring is ground truth
Antennas
Particles guess many locations per timestep, so data are uncertain
6th Floor in CS building
Lahar -- SIGMOD 2008 -- Christopher Re9
Tag t Loc P
Joe 7 422 0.4
Hall3 0.4
Hall4 0.2
Joe 8 422 0.6
Hall3 0.2
Hall4 0.2
Sue 7 … …
From particles to a probabilistic stream
At(tag,loc)
Query Particle Filter output via At – a model based view
Lahar -- SIGMOD 2008 -- Christopher Re
(0.4+0.2) * 0.6 = 0.36
Tag t Loc P
Joe 7 422 0.4
Hall3 0.4
Hall4 0.2
Joe 8 422 0.6
Hall3 0.2
Hall4 0.2
Sue 7 … …
Semantics of the Model
10
At(tag,loc)
Tag t Loc
Joe 7 Hall4
Joe 8 422
Sue 7 …
Prob = 0.2 * 0.6 * …
“Joe enters 422” @ t=8A query q returns the probability that q is true at each time t
possible stream (worlds)
Probability outside 422 (in Hall3,Hall4)
Lahar -- SIGMOD 2008 -- Christopher Re11
Outline RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Lahar -- SIGMOD 2008 -- Christopher Re12
(` ',` 4 ') (` ', 4̀ 2 '); 2At Jo At Je H oeall
Lahar Queries by Example
Alert when Joe is in hallway 4 and later in office 422
Inspired by Cayuga [Demers et al 2006, White et al 2007]
Lahar -- SIGMOD 2008 -- Christopher Re13
Joe in 422
(` ',` 4 ') (` ', 4̀ 2 '); 2At Jo At Je H oeall
Lahar Queries by Example
Alert when Joe is in hallway 4 and later in office 422
Joe in Hall4 Joe in 422
Inspired by Cayuga [Demers et al 2006, White et al 2007]
Lahar -- SIGMOD 2008 -- Christopher Re14
Joe in 422
(` ',` 4 ') (` ', 4̀ 2 '); 2At Jo At Je H oeall
Lahar Queries by Example
Alert when Joe is in hallway 4 and later in office 422
Joe in Hall4 Joe in 422
Inspired by Cayuga [Demers et al 2006, White et al 2007]
`422' (` ',` 4 ')( ; (` ', ))l At Joe H At Joeall l
Alert when Joe is in hallway 4, and immediately in office 422
Lahar -- SIGMOD 2008 -- Christopher Re15
Joe in 422
(` ',` 4 ') (` ', 4̀ 2 '); 2At Jo At Je H oeall
Lahar Queries by Example
Alert when Joe is in hallway 4 and later in office 422
Joe in Hall4 Joe in 422
Inspired by Cayuga [Demers et al 2006, White et al 2007]
`422' (` ',` 4 ')( ; (` ', ))l At Joe H At Joeall l
Alert when Joe is in hallway 4, and immediately in office 422
Joe in Hall4 Joe in 422
Joe
Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)
Lahar -- SIGMOD 2008 -- Christopher Re16
( (` ',` 4 '); (` ', ))4̀22 '
At Joe Hall At Joe ll
Regular Queries (Efficient, streamable) Alert when Joe enters 422
Extended Regular (Efficient, streamable) Alert when anyone enters 422
A hierarchy of Lahar queries
Lahar -- SIGMOD 2008 -- Christopher Re17
( (` ',` 4 '); (` ', ))4̀22 '
At Joe Hall At Joe ll
`422' ( ( ,` 4 '); ( , ))l At p Hall At p l
A hierarchy of Lahar queries
Regular Queries (Efficient, streamable) Alert when Joe enters 422
Extended Regular (Efficient, streamable) Alert when anyone enters 422
Safe (Efficient, but not streamable) Unsafe (Inefficient)
Lahar -- SIGMOD 2008 -- Christopher Re18
Outline RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Lahar -- SIGMOD 2008 -- Christopher Re19
Joe
Review: A non-probabilistic example
Alert me when Joe enters 422
`422' (` ',` 4 '( ; (` ', )) )l A Att Joe Hal Joe llq
Tag T Loc
Joe 7 Hall 4
Joe 8 422
Tag T Loc
Joe 7 Hall 4
Joe 8 423
Accept at t = 8
{}
{1}
{2}
{}
{1}
{}
Final
Joe in Hall4 Joe in 4221 2
… now with probabilities
Lahar -- SIGMOD 2008 -- Christopher Re
Joe Final
Joe in Hall4 Joe in 4221 2
`422' (` ',` 4 '( ; (` ', )) )l A Att Joe Hal Joe llq
Accept t=8 with p = 0.3
Alert me when Joe enters 422
{} 1.0
{} 0.5, {1} 0.5
{} 0.65, {1} 0.05, {2} 0.3
Distribution on States
Tag T Loc P
Joe 7 Hall4 0.5
Joe 8 423 0.3
422 0.6
Lahar -- SIGMOD 2008 -- Christopher Re21
Lies in the preceding slides… (technical details) Richer predication: “Alert when Joe enters any
office”
Translate query and input into an alphabet
Joe Final
Joe in Hall4 Joe in 422
1 2
Key Technical Detail: Alphabet is small in data Streamable
See paper for compilation
22
`422' ( ( ,` 4 '); ( , ))lq At p Hall At p l
Extension to Extended regular
Lahar -- SIGMOD 2008 -- Christopher Re
“Alert when anyone enters 422”
23
`422 '[ ` '] ( (` ',` 4 '); (` ', ))lq p Tom At Tom Hall At Tom l `422 '[ ` '] ( (` ',` 4 '); (` ', ))lq p Joe At Joe Hall At Joe l
`422' ( ( ,` 4 '); ( , ))lq At p Hall At p l
Extension to Extended regular
Lahar -- SIGMOD 2008 -- Christopher Re
Algorithm: (Obs1) suggests run automaton for each person (Obs2) suggests multiply to get prob any is true
Space = O(# persons), not # timesteps: can stream
“Alert when anyone enters 422”
(Obs 1) Each query is regular (Obs 2) disjoint sets of eventsHence, probabilistically independent
Summary of Contributions Regular Queries (Efficient, streamable)
Compiled to an automaton,streaming, O(1) space
Extended regular (Efficient, streamable) Streaming with O(m) space, i.e. # of persons.
See paper for Markovian correlations, more sophisticated predication, complete compilation and static analysis algorithms
Safe (Efficient, but not streamable) Unsafe (Inefficient, most #P-hard)
Lahar -- SIGMOD 2008 -- Christopher Re25
Outline RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Lahar -- SIGMOD 2008 -- Christopher Re26
Experimental Setup
Quality: How is P/R affected by keeping probs? 52 objects, 352 locations, 10k sq. ft.
2x30min trace with 10 min break in between Participants marked down true locations
Lahar -- SIGMOD 2008 -- Christopher Re27
2 1( ), ( ) ( ) 1 2( ( ( , ); ( , ))Person p Coffee l Hallway l At p l At p l
Experimental Setup
Quality: How is P/R affected by keeping probs? 52 objects, 352 locations, 10k sq. ft.
2x30min trace with 10 min break in between Participants marked down true locations “Alert when anyone enters a coffee room”
Baseline: Most Likely Estimate (MLE) Each timestep/Each person: most likely location
0
0.2
0.4
0.6
0.8
1
Quality: Realtime – Improve over MLE?
Lahar -- SIGMOD 2008 -- Christopher Re28
Declare an event “true”, if its Pr > threshold Vary threshold
Precision
0
0.2
0.4
0.6
0.8
1Recall
0
0.2
0.4
0.6
0.8
1 F1
10% improvement in F1
Lahar -- SIGMOD 2008 -- Christopher Re29
Performance: Is the cost too high?Synthetic Data – Same query
Lahar -- SIGMOD 2008 -- Christopher Re30
Related Work Event Queries – Deterministic
Cayuga, SASE, SnoopIB
Model-Based Views BBQ, recently, Kanagal et al ICDE 08
Probabilistic Databases Mystiq, Trio, MayBMS, Maryland, Purdue,MCDB
Particle Filters on HMMs Doucet, Godsill
Lahar -- SIGMOD 2008 -- Christopher Re31
Conclusion Showed Lahar
Processed output of several inference tasks (HMMs) Applies more generally than just RFID
Quality (F1) gains by keeping probability
Performance usable in real-time Lots of concurrent tags No indexing!
Lahar -- SIGMOD 2008 -- Christopher Re32
Lahar -- SIGMOD 2008 -- Christopher Re33
Overview of Regular Query Algorithm
1. Compile an event query q1. Automaton (A) over a language L
2. Mapping (M) events to subsets of L
2. Runtime – Input is set of events E1. Map E into subsets of L via M
2. Maintain set of possible states of A
Deterministic Probabilistic
stays same
stays same
distribution
distribution
Size of distribution depends only on the query, q.
NB: example to follow
For details, see paper
Lahar -- SIGMOD 2008 -- Christopher Re34
Why are ER queries hard? Regular Queries ~ Regular Expressions
Mapping is non-trivial Inspired by Cayuga [Demers et al. 06]
Queries have #P-combined complexity Encode mDNF as regular expression
Intuition: n-sized automaton leads to Extended regular ~ 1 NFA per/person
k persons implies O(k)-size automaton Exponential cost
time(2 )n
When ER, can avoid blowup
Lahar -- SIGMOD 2008 -- Christopher Re35
Regular and Extended Regular Query is regular if no variable is shared between
subgoals
Query is extended regular if any variable shared by two subgoals, is shared by all subgoals
p is shared between subgoals
502 ( (' ', '501'); (' ', ))l At Joe At Joe l
502 ( ( , 5̀01 ); ( , ))l At p At p l
Lahar -- SIGMOD 2008 -- Christopher Re36
Correlations
Lahar -- SIGMOD 2008 -- Christopher Re37
Sequencing by example Sequencing is parameterized [Cayuga]
502' ( ( , 5̀01'); ( , ))l At p At p l
( ,501)Joe ( ,502)Bob ( ,502)Joe
Time
( ,503)Joe
Semicolon means “the next event among those that match next goal”
Semicolon is not “after”
Lahar -- SIGMOD 2008 -- Christopher Re38
Compilation by example Each goal “corresponds” to two letters:
move (m) – the query should advance accept (a) – the next subgoal accepts
1 50` 2' ( ( , 5̀01 ); ( , ))lq At Joe At Joe l
1 1 1 2 2{m , , , }L a m a
1a 2a
2{ }m1 1 2( ,501) { , , }Joe m a m
2 2( ,502) { , }Joe m a
Any other maps to empty set0 2( , ) { }Joe l m
Final
Does not contain
Does contain
qM
Lahar -- SIGMOD 2008 -- Christopher Re39
Subtle example..
What about:
1 50` 2' ( ( , 5̀01 ); ( , ))lq At Joe At Joe l 1 1 1 2 2{m , , , }L a m a
1a 2a
2{ }m
1 1 2( ,501) { , , }Joe m a m
2 2( ,502) { , }Joe m a
Any other maps to empty set0 2( , ) { }Joe l m
Final
Does not contain
Does contain
1M
2 ( , 5̀01 ); ( , 5̀02 ')q At Joe At Joe
0( , )Joe l
2M
Lahar -- SIGMOD 2008 -- Christopher Re40
CUT II
Lahar -- SIGMOD 2008 -- Christopher Re41
Motivating Apps RFID apps
Diary and Active Calendar Application. Alert if I go to a database meeting.
Supply chain Alert if Mach 3 razors are being stolen
Many independent HMMs Elder care [Intel/UW]
Alert if elder takes their medicine with water Activity Recognition Financial applications on predictive HMM
Alert if head-and-shoulders market
Lahar -- SIGMOD 2008 -- Christopher Re42
Compile Select and Filter
Intuition: goal maps to two letters: match (m) : matches filter accept (a) : accepted by select
(` ', 5̀01'); (` ', 5̀02 ')filterq At Joe At Joe
`502' ( (` ', 5̀01 ); (` ', ))select lq At Joe At Joe l
1 1 2 2{m , , , }L a m a
1a 2a
2{ }m Final
Does not contain
Does contain
language and automaton are the same for both queries
Lahar -- SIGMOD 2008 -- Christopher Re43
Wrinkle in the language:Filter v. Selection
“Alert next time Joe is in 502 after he is in 501”
(` ', 5̀01'); (` ', 5̀02 ')filterq At Joe At Joe
`502' ( (` ', 5̀01 ); (` ', ))select lq At Joe At Joe l
Time
Yes
No
( ,501)Joe ( ,502)Joe( ,503)Joe
“Alert if the next place Joe is in after 501 is 502”
At
Lahar -- SIGMOD 2008 -- Christopher Re44
Recap of Algorithms Regular Queries
Compiled them to an NFA, then used image Data complexity O(1)
Extended regular Several regulars multiplied together Depends on number of distinct people in the data, not
number of time steps.
Lahar -- SIGMOD 2008 -- Christopher Re45
Text1 Euclid Eculid Euclid Euclid Euclid Euclid Symbol
Lahar -- SIGMOD 2008 -- Christopher Re46
(` ',` 4 ') (` ', 4̀ 2 '); 2At Jo At Je H oeall
`422' (` ',` 4 ')( ; (` ', ))l At Joe H At Joeall l
Lahar Queries by Example
Alert when Joe is in hallway 4 and later in office 422
Joe in Hall4 Joe in 422
Alert when Joe is in hallway 4, and immediately in office 422
Joe in Hall4 Joe in 422
Inspired by Cayuga [Demers et al 2006, White et al 2007]
Joe
Joe in 422
Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)
47
Quality: Archived – Improve over Viterbi?
Lahar -- SIGMOD 2008 -- Christopher Re
Smoothing v. Viterbi (MAP) Lahar tracks of Markovian Correlations Viterbi leverages correlations for MAP estimate
0
0.2
0.4
0.6
0.8
1Precision Recall F1
0000000000000000000000000000000000000000000000000000
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Approx ~30% gain in F1