modeling and mining sequential data

Modeling and Mining Sequential Data

Machine Learning and Data MiningPhilipp Singer

CC image courtesy of user puliarfanita on Flickr

What is sequential data?

Stock share price (Bitcoin)

Screenshot from bitcoinwisdom.com

Daily degrees in Cologne

Screenshot from google.com (data from weather.com)

Human mobility

Screenshot from maps.google.com

Web navigation

AustriaGermanyC.F. Gauss

Song listening sequences

Screenshots from youtube.com

Let us distinguish two types of sequence data

Continuous time series

Categorical (discrete) sequences

Let us distinguish two types of sequence data

Continuous time seriesStock share price

Daily degree in Cologne

Categorical (discrete) sequences (focus)Sunny/Rainy weather sequence

Human mobility

Web navigation

Song listening sequences

This lecture is about...

Modeling

Predicting

Pattern Mining

This lecture is about...

Modeling

Predicting

Pattern Mining

Markov Chains

S1S2S3

1/21/21/32/31

Markov Chain Model

Markov Chain Model

Stochastic Model

Transitions between states

S1S2S3

1/21/21/32/31StatesTransition
probabilities

Markov Chain Model

Markovian propertyThe next state in a sequence only depends on the current one, and not on a sequence of preceding ones

S1S2S3

1/21/21/32/31StatesTransition
probabilities

Classic weather example

0.1SunnyRainy

0.90.50.5

Formal definition

State space

Amounts to sequence of random variables

Markovian memoryless property

Transition matrix

Rows sum to 1Transition matrix PSingle transition
probability

Example

0.1SunnyRainy

0.90.50.5

Transition matrix

Likelihood

Transition probabilities are parameters

Transition probabilityTransition count

Maximum Likelihood (MLE)

Given some sequence data, how can we determine parameters?

MLE estimation

Maximize!

See ref [1]

[1] http://journals.plos.org/plosone/articleid=10.1371/journal.pone.0102070

Prediction

Simply derived from transition probabilities

?

One option: Take max prob.

Prediction

What about t+3?

?

Pattern mining

Simply derived from (non-normalized)
transition matrix

90

2

2

1

Most common transitionSequential pattern

Full example

Training sequence

Full example

5

2

2

1

Transition counts

5/7

2/7

2/3

1/3

Transition matrix (MLE)

Full example

5/7

2/7

2/3

1/3


Likelihood of given sequence

We calculate the probability of the sequence with the assumption that we start with sunny.

Full example

5/7

2/7

2/3

1/3


?

Prediction?

Full example

5/7

2/7

2/3

1/3


?

Prediction?

Higher order Markov Chain models

Drop the memoryless assumption?

Models of increasing order2nd order MC model

3rd order MC model

...

Higher order Markov Chain models

Drop the memoryless assumption?

Models of increasing order2nd order MC model

3rd order MC model

...

2nd order example

depends on

Higher order to first order transformation

Transform state space

2nd order example new compound states

2nd order example

3

1

1

1

1

0

1

1

3/4

1/4

1/2

1/2

1/1

0

1/2

1/2

Reset states

R

R

...

RR

R

R

Marking start and end of sequences

Transformation easier (same #transitions)

Comparing models

1st vs. 2nd order

Statistical model comparison necessary

Nested models higher order always fits better

Account for potential overfitting

Model comparison

Likelihood ratio testRatio between likelihoods for order m and k

Follows a Chi2 distribution with degrees of freedom

Only for nested models

Akaike Information Criterion (AIC)

The lower the better

Bayesian Information Criterion (BIC)

Bayes FactorsRatio of evidences (marginal likelihoods)

Cross validation

See http://journals.plos.org/plosone/articleid=10.1371/journal.pone.0102070

AIC example

R

R

...

RR

R

R

5/8

2/8

2/3

1/3

RR1/8

0/3

1/1

0/1

0/1

3/5

1/5

1/2

1/2

0

1/2

1/2

RRR

R

R

RR1/5

0

1/1

0

0

1/1

0

0

1/1

0

0

0

0

0

0

0

0

0

0

0

0

0

1st order parameters2nd order parameters

AIC example

5/8

2/8

2/3

1/3

RR1/8

0/3

1/1

0/1

0/1

3/5

1/5

1/2

1/2

0

1/2

1/2

RRR

R

R

RR1/5

0

1/1

0

0

1/1

0

0

1/1

0

0

0

0

0

0

0

0

0

0

0

0

0

1st order parameters2nd order parameters

Example on
blackboard

Markov Chain applications

Google's PageRank

DNA sequence modeling

Web navigation

Mobility

Hidden Markov Chain Model

Hidden Markov Models

Extends Markov chain model

Hidden state sequence

Observed emissions

What is the weather like?

Forward-Backward algorithm

Given emission sequence

Probability of emission sequence?

Probable sequence of hidden states?

Hidden seq.Obs. seq.

Check out YouTube tutorial: https://www.youtube.com/watch?v=7zDARfKVm7sFurther material: cs229.stanford.edu/section/cs229-hmm.pdf

Setup

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R0.5

0.5

Note: Literature usually uses a start probability and uniform end probability for the forward-backward algorithm.

Forward

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R

0.4

0.1

R0.5

0.5

Forward

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R

0.4

0.1

0.034

0.144

R0.5

0.5

What is the probability of going to each possible state at t2 given t1?

Forward

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R

0.4

0.1

0.034

0.144

0.011

0.061

R0.5

0.5

Forward

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R

0.4

0.1

0.034

0.144

0.011

0.061

0.035

0.006

R0.5

0.5

forwardR

0.5

0.5

reset transition

Backwards

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R

R0.5

0.5

0.5

0.5

Backwards

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

0.31

0.28

R0.5

0.5

What is the probability of arriving at t4 given each
possible state at t3?R

0.5

0.5

Backwards

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R0.5

0.5

0.010

0.12

R

0.31

0.28

0.5

0.5

Backwards

0.7

0.3

0.6

0.4

R0.5

0.5

0.9

0.2

0.1

0.8

R0.5

0.5

0.039

0.049

R

0.097

0.12

0.31

0.28

0.5

0.5

R

backwardreset

emission

Forward-Backward

Most likely state at t2

0.4

0.1

0.034

0.144

0.011

0.061

0.035

0.006

0.039

0.049

0.097

0.12

0.31

0.28

0.5

0.5

Forward-Backward

Posterior decoding

Most likely state at each t

For most likely sequence: Viterbi algorithm

Learning parameters

Train parameters of HMM

No tractable solution for MLE known

Baum-Welch algorithmSpecial case of EM algorithm

Uses Forward-Backward

HMM applications

Speech recognition

POS tagging

Translation

Gene prediction

Other related methods

Sequential Pattern Mining

PrefixSpan

Apriori Algorithm

GSP Algorithm

SPADE

Reference: rakesh.agrawal-family.com/papers/icde95seq.pdf

Graphical models

Bayesian networksRandom variables

Conditional dependence

Directed acyclic graph

Markov random fieldsRandom variables

Markov property

Undirected graph

Questions?

Philipp [email protected]

modeling and mining sequential data

Education