middleware systems research group msrg.org predictive publish/subscribe matching joint work with...
Embed Size (px)
TRANSCRIPT

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
Predictive Publish/Subscribe Matching
Joint work with Vinod Muthusamy& Haifeng Liu
University of Toronto
P-ToPSSproject
Hans-Arno Jacobsen

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgLittle Anecdote
2
Date: Mon, 14 Sep … 10:37:26 -0400From: "[email protected] ... "To: …Cc: … CNS Security AdminSubject: DDoS attack originating from …

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org/var/log/secure* & LogWatchaaron/password from 211.43.206.53: …abdullah/password from 211.43.206.53: abraham/password from 211.43.206.53: abram/password from 211.43.206.53: account/password from 142.150.237.133:account/password from 211.43.206.53:adam/password from 211.43.206.53: addison/password from 211.43.206.53: aditya/password from 211.43.206.53: admin/password from 142.150.237.133: 18 Time(s)admin/password from 211.43.206.53: 18 Time(s)administrator/password from 142.150.237.133: 3 Time(s)administrator/password from 211.43.206.53: 3 Time(s)jacobsen/password from 191.43.206.53: 2 Time(s)
3

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgAnd So It Happened: Post-mortem forensics via events across different logs
… deniedJohn 211.43.206.53 successful timestamp…John logoff timestamp…John 190.35.106.46 successful timestamp…John password changed
4
Had set user john with password john!

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgPredictive Analytics?• Series of failed login attempts from same IP
– System is under attack• Series of failed login attempts from same IP,
followed by successful login from that IP, followed by immediate logoff– System compromised
• Could we predict that the system is going to be compromised soon with a certain probability, after observing a partial match of the above pattern? – E.g.,: "failed logins from IP, successful login from IP”
5Compromised? Compromised? Compromised?

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgEvents, Subscriptions & Publish/Subscribe
• Here, events are– Login attempts, logoff, system compromised
• Here, subscriptions are– Specific patterns of interest
• Series of login attempts from same IP• Series of login attempts from same IP, followed by logoff
• The publish/subscribe system is the abstraction that matches subscriptions based on events observed
• A match detects the event, e.g., system compromised
6

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgOutline• Predictive Toronto Publish/Subscribe System
• Event & subscription language model
• Matching with P-ToPSS
• Predicting with P-ToPSS
• Evaluation
7

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgP-ToPSS is Latest ToPSS Member
• For many applications raising an alert after a malicious activity occurred is too late– Credit card fraud (fraud committed)– Network intrusion (system compromised)– Problem determination (problem occurred)– Root-cause analysis (system crashed, poor user experience)
• Capability to predict the probability that a given subscription will match in the future is needed.
• P-ToPSS computes the probability that a subscription will match based on the event history and based on partial matches observed so far.
8

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgP-ToPSS Model
9
Match
Engine cs1 will be matched with Probability 0.5
cs4 will be matched with Probability 0.75
cs2 is matched
cs1 is fully matchedcs1 will be matched with Probability 0.8
Publish/Subscribe matching problem• Find all matches
Publish/Subscribe prediction problem• Find partial matches • Determine subscriptions with matching probability > threshold

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
10
Event ModelAn event: e = {(a1,v1),(a2,v2), …(an,vn)}
Event stream: {e1, e2, … ek, …}
Events are ordered (system timestamps)

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
11
Subscription Language Model• Primitive subscriptions
– S = p1 p2 p3, …– pi is a Boolean predicate
• Composite subscriptions– CS = R(S1, S2 , S3 , … Sm)
• R: Operators– Temporal operators:
• , : contiguous sequence• ; : non-contiguous sequence• @:explicit temporal operator
– Boolean operators:• : conjunction• : disjunction
• Contiguous event sequence• No event can be skipped
• Non-contiguous event sequence• Events can be skipped

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgExample
s1: ip=$x login=denied
s2: ip=$x login=denied
s3: ip=$x login=success
s4: ip=$x login=success
s5: ip=$x action=passwd
s6: ip=$x action=logoff
12
csintrusion matched by {e0 , e1}, e2, e3, e4
csintrusion = s1; ( ( s2;s3@(t3-t2<d) ) (s4,s5) );s6

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgProblem Statement
• Matching Problem– Given a set of composite subscriptions, CS, and
an event stream, {ei}, find all cs = R(s1, s2, …, sn) such there that exists {ej1,ej2,…, ejn} {ei} and ej1 matches s1, … , ejn matches sn subject to R and all time constraints are satisfied.
• Prediction Problem– Find all partially matched cs such that
Prcs(full match | partial) > θcs
13

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgRequired Matching Tasks
• Composite subscription: s1; ( (s2;s3@(t3-t1<d) ) (s4,s5) );s6
• Primitive subscriptions, like si, matching single events (i.e., sets of attribute-value-pairs)
• Sequences of primitive subscriptions matching consecutive and non-consecutive events in the input
• Boolean expressions, like term1 term2 above, matching higher-level patterns of events
• Computation of probabilities to predict full matches given partial matches
14

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgMatching Engine
15
Primitive Subscriptions
Matcher
State Machine
Engine
Boolean Expression Tree
Matcher
Prediction Engine
Full matches Full matches
Event stream
Derived events
Derived events
Partial matches
Partial matches
Predictions (subscription, matching probability > θS)
Primitive subscription matches Primitive subscription matches
s1; ( (s2;s3@(t3-t1<d))(s4,s5) );s6
term1 term2 s2;s3
s3

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgAlgorithms for Matching Tasks
• Primitive Subscription Matcher– BDD-based approach (our ICDCS’05 algorithm)– Alternatively, our SIGMOD’01 algorithm or our new
indEX (fastest Boolean Expression Index in the market)
• Boolean Expression Tree Matcher (state-based)– Extension of the Rete algorithms as in-memory
event processing network (Forgy, 1982)– For extensions & implementation , see our PADRES
code base at padres.msrg.org
16

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgAlgorithms for Matching Tasks
• State Machine Engine– Based on evaluating finite state machines (FSMs)– Combined with techniques to merge states to amortize
processing of similar subscriptions– Combined with algorithms and data structures to track
time conditions
• Prediction Engine– Based on training and evaluating a Markov model
• Trained on past events• Evaluation over event stream
17

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgState Machine Engine
• State machine creation• State machine evaluation
18

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
Example: F, F, F @(tN3-tN1<d), S
We abstract for ease of presentation• F represents a primitive subscription that evaluates to true for a failed
login• S represents a primitive subscription that evaluates to true for a
successful login• Index in time constrain refers to position (state) in the subscription (FSM)
19
N0N1
(F)
F FF
@(tS3-tS1<d) SN2
(F,F)N3
(F,F,F)
N4(F,F,F,
S)
t
Time of the most recent transition into the state
• Explicit temporal operator treated as another predicate to be evaluated over transition times tracked for all states
Contiguous sequence operator

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
20
N1
(F)
F FF
@(tS3-tS1<d) SN2
(F,F)N3
(F,F,F)
t1 t2 t3
FF
Event stream
F
time
N1
(F)
F FF
@(tS3-tS1<d) SN2
(F,F)N3
(F,F,F)
S = F, F, F @(tN3-tN1<d), S
Current state
N1
(F)t1
At t1
At t2
At t3
F
N1
(F)
F FF
@(tS3-tS1<d) SN2
(F,F)N3
(F,F,F)
F F
F
N2
(F,F)t2
N1
(F)t2
F
F
F
F
F
N3(F,F,F)
t3
N2
(F,F)t3
N1
(F)t3
@(tS3-tS1<d)
Contiguous sequence operator

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
Example: F; S1; F; S2@(tS2-tS1<T)
21
N0N1
(F)
F S F [email protected]
(F;S)N3
(F;S;F)
N4(F;S;F;[email protected])
• Events not contributing to matching a subscriptions are allowed to occur (must remain in current state; achieved via self-links)
• Upon a match of the next primitive subscription • Time conditions are checked, if any• Transition times are updated
• Transition times are only tracked for primary & secondary links
Non-contiguous sequence operator
F
*
S
*
F
*
Primary link
Secondary link
Self linkTriggered for every eventexcept those that triggerprimary & secondary links.
First transition into state Continued matching of primitive subscription that led tothe transitioning into this state.

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
22
N0N1
(S1)S1 S2
@T1
S3
@T3
@T2
S1; S2 @T1; S3 @T2 @ T3
N2(S1;S2)
N3(S1;S2;S3
)
not(S2)
S2 not(T1)
not(S3)
S3 ( not(T2) not(T3) )
T1 : (tS2-tS1 < 3)T2 : (tS3-tS1 < 6)T3 : (tS3-tS2 > 3)
t1 time
S1
t4 t7
S1 S1 S2 S2 S2 S3 S3
not(S1)
not(S1)
S1
Time(S1): S1:t1
S1:t2
S1:t3
Time(S2):
S2:t4 Tc(S1) = {t2, t3} S2:t5 Tc(S1) = {t3}
Time(S3): S3:t8
Tc(S2) = {t4}
Tc(S1) = {t3}
S2 : t4 S2 : t5 S2 : t6 S3 : t7 S3 : t8
S1 S2 S3

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
23
Merging State MachinesTwo states N1 and N2 are equivalent iff:
1. The number of incoming transitions of N1 and N2 are equal.2.Any incoming transitions arrive from equivalent states and are triggered by the same set of events. Initial states are equivalent.
N0a N2
(a;b)
b cN1
(a)
*N3
(a;b,c)
N0a N2
(a;b)
b dN1
(a)
*N3
(a;b,d)
N0a N1
(a)
M0
M2
(a;b)
bc
M1
(a)
a
*
M4
(a;b,d)
M3
(a;b,c)
d
N5
(a)
a
Merged:
• a; b; c• a; b; d• a

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
24
Markov Model for Prediction• FSMs record incremental matches of subscriptions• Probability of transitioning to next state for a given
event depends only on current state• Our FSMs are Markov processes• Our prediction algorithm uses the properties of
Markov processes to predict future matches based on current state and event history– Probability of reaching the final state in n events– … of reaching final state in the next 1, 2, 3, … n events

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
25
Prediction & Training• Compute long-run transition probability of reaching a
given state• Based on the input (event history), we count the
number of times transitions are taken• Based on counters, we compute transition
probabilities of the model• Transition probability from state i to j is • Complete Markov chain with finite state space• pij = Pr(Xn+1 = j| Xn = i)
– Conditional probability of transitioning to j given i
# times transition
taken
all incoming
transitions

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgExperiments
• Synthetic workload
• Real data set
26

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgEffect of Number of Subscriptions
27
• Merging reduces number of states by up to 30% for given data set• Number of states increases linearly in number of subscriptions• More states are required for workloads with less state sharing potential
Number of states Average matching time per event
• Matching time increases in the number of subscriptions• More sharing requires more processing as a given event may trigger more transitions
Gaussian
Uniform
More sharing
Less sharing

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
Effect of Number of Non-contiguous Operators
• Matching time increases in number of non-contiguous operators
• More and more subscription instances are partially matched waiting for events
• Asks for a garbage collection scheme
28
Average matching time per event

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgExperiments on Synthetic Workload
29
• Precision decreases as look-ahead increases• Precision increases as prediction-threshold increases and stabilizes for large thresholds

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgExpert Model (full) vs. Learned Model
30
Full model (about 1400 states) Learned model (5 states)
Precision defined as True positives / All predictionsResult: With increasing look-ahead learned model results in higher precision.

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.orgConclusions• P-ToPSS is a new publish/subscribe model for
event stream processing• Predicts the probability a subscription will match in
the future• Performs traditional publish/subscribe matching• Supports state-based, temporal and Boolean
operators over predicates (complex subscriptions)• Based on Markov chains for prediction• Prediction performance of learned model is
better than hand-crafted model in our experiments
31

MIDDLEWARE SYSTEMSRESEARCH GROUP
msrg.org
32