process mining - chapter 5 - process discovery

Post on 07-Nov-2014

521 Views

Category:

Business

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.

TRANSCRIPT

Chapter 5Process Discovery: An Introduction

prof.dr.ir. Wil van der Aalstwww.processmining.org

Overview

PAGE 1

Part I: Preliminaries

Chapter 2 Process Modeling and Analysis

Chapter 3Data Mining

Part II: From Event Logs to Process Models

Chapter 4 Getting the Data

Chapter 5 Process Discovery: An Introduction

Chapter 6 Advanced Process Discovery Techniques

Part III: Beyond Process Discovery

Chapter 7 Conformance Checking

Chapter 8 Mining Additional Perspectives

Chapter 9 Operational Support

Part IV: Putting Process Mining to Work

Chapter 10 Tool Support

Chapter 11 Analyzing “Lasagna Processes”

Chapter 12 Analyzing “Spaghetti Processes”

Part V: Reflection

Chapter 13Cartography and Navigation

Chapter 14Epilogue

Chapter 1 Introduction

Process discovery

PAGE 2

software system

(process)model

eventlogs

modelsanalyzes

discovery

records events, e.g., messages,

transactions, etc.

specifies configures implements

analyzes

supports/controls

enhancement

conformance

“world”

people machines

organizationscomponents

businessprocesses

Process discovery = Play-In

PAGE 3

event log process model

Play-In

event logprocess model

Play-Out

event log process model

Replay

• extended model showing times, frequencies, etc.

• diagnostics• predictions• recommendations

Example

PAGE 4

a

b

c

de

p2

end

p4

p3p1

start

Event log contains all possible traces of model and vice versa.

Another example

PAGE 5

a

b

c

df

p2

end

p4

p3p1

start

e

p5

Generalization: event log contains only subset of all possible traces of model.

Notation is less relevant (e.g. BPMN)

PAGE 6

a

b

c

de

p2

end

p4

p3p1

start

a

start end

b

c

e

d

Another BPMN example

PAGE 7

a

b

c

df

p2

end

p4

p3p1

start

e

p5

a

start end

b

c

e

d

f

Challenge

• In general, there is a trade-off between the following four quality criteria:

1.Fitness: the discovered model should allow for the behavior seen in the event log.

2.Precision (avoid underfitting): the discovered model should not allow for behavior completely unrelated to what was seen in the event log.

3.Generalization (avoid overfitting): the discovered model should generalize the example behavior seen in the event log.

4.Simplicity: the discovered model should be as simple as possible.

PAGE 8

Process Discovery: example of algorithm

PAGE 9

α

PAGE 10

>,→,||,# relations

• Direct succession: x>y iff for some case x is directly followed by y.

• Causality: x→y iff x>y and not y>x.

• Parallel: x||y iff x>y and y>x

• Choice: x#y iff not x>y and not y>x.

a>ba>ca>eb>cb>dc>bc>de>d

a→ba→ca→eb→dc→de→d

b||cc||b

abcdacbdaed

b#ee#bc#ea#d…

PAGE 11

Basic Idea Used by α Algorithm (1)

a b

(a) sequence pattern: a→b

PAGE 12

Basic Idea Used by α Algorithm (2)

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

a

b

c

(c) XOR-join pattern:a→c, b→c, and a#b

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

PAGE 13

Basic Idea Used by α Algorithm (3)

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

a

b

c

(e) AND-join pattern:a→c, b→c, and a||b

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

Example Revisited

PAGE 14

a

b

c

de

p2

end

p4

p3p1

start

Result produced by α algorithm

a>ba>ca>eb>cb>dc>bc>de>d

a→ba→ca→eb→dc→de→d

b||cc||b

b#ee#bc#ea#d…

Footprint of L1

PAGE 15

a

b

c

de

p2

end

p4

p3p1

start

Footprint of L2

PAGE 16

a

b

c

df

p2

end

p4

p3p1

start

e

p5

Simple patterns

PAGE 17

a b

(a) sequence pattern: a→b

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

a

b

c

(c) XOR-join pattern:a→c, b→c, and a#b

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

a

b

c

(e) AND-join pattern:a→c, b→c, and a||b

Algorithm

PAGE 18

Let L be an event log over T. α(L) is defined as follows. 1. TL = { t ∈ T | ∃σ ∈ L t ∈ σ}, 2. TI = { t ∈ T | ∃σ ∈ L t = first(σ) }, 3. TO = { t ∈ T | ∃σ ∈ L t = last(σ) }, 4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧

∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 }, 5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XL

A ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) }, 6. PL = { p(A,B) | (A,B) ∈ YL } ∪{iL,oL}, 7. FL = { (a,p(A,B)) | (A,B) ∈ YL ∧ a ∈ A } ∪ { (p(A,B),b) | (A,B) ∈

YL ∧ b ∈ B } ∪{ (iL,t) | t ∈ TI} ∪{ (t,oL) | t ∈ TO}, and 8. α(L) = (PL,TL,FL).

Key idea: find places

PAGE 19

a1

...

a2

am

b1

b2

bn

p(A,B) ...

A={a1,a2, … am} B={b1,b2, … bn}

4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 },

5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XLA ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) },

Places as footprints

PAGE 20

a1

...

a2

am

b1

b2

bn

p(A,B) ...

A={a1,a2, … am} B={b1,b2, … bn}

PAGE 21

a

b

c

de

p2

end

p4

p3p1

start

Another event log L3

PAGE 22

Model for L3

PAGE 23

a b

c

d

e

p({a,f},{b})iL

p({b},{c})

p({b},{d})

p({c},{e})

p({d},{e})

g

oLp({e},{f,g})

f

Another event log L4

PAGE 24

b

c

p({a,b},{c}) oL

a

iL e

d

p({c},{d,e})

Event log L5

PAGE 25

PAGE 26

Discovered model

PAGE 27

b

p({a},{e})

oLaiL

c

e

f

d

p({e},{f})

p({b},{c,f})p({a,d},{b})

p({c},{d})

Limitation of α algorithm(implicit places)

PAGE 28

g

a

c

d

e

fb

p1

p2

Green places are implicit!

Limitation of α algorithm(loops of length 1)

PAGE 29

a c

b

a c

b

Limitation of α algorithm(loops of length 2)

PAGE 30

a b d

c

a

b

d

c

Limitation of α algorithm(non-local dependencies)

PAGE 31

b

c

a

e

dp1

p2

Green places are not discovered!

Difficult constructs for α algorithm

PAGE 32

a

b

c

Taking the transactional life-cycle into account

PAGE 33

start

assigned

assign

running

complete

a

start

running

complete

b

suspended

suspend

resume

start

assigned

assign

running

complete

c

Rediscovering process models

PAGE 34

The rediscovery problem: Is the discovered model N’ equivalent to the original model N?

discoveredprocessmodel

original process model event

log

simulate discover

N=N’ ?N N’

Equivalence: trace equivalence, bisimilarity, and branching bisimilarity

PAGE 35

s1

birth

s2

s3 s4

hell

curse

curse

curse

birth

curse

hell

s5

s6

s7

birth

curse

s8

s9

s11

curse

curses10

TS1 TS2 TS3

heaven

heaven heaven hell

heaven

hell

?

heaven

Three trace equivalent transition systems: TS1 and TS2are not bisimilar, but TS2 and TS3 are bisimilar

Branching bisimilarity defined for YAWL

PAGE 36

s1

TS1 TS2

start

check

end

reject accept

c1 c2

s2

acceptreject

s3 s4

s5

s6

check

s7

acceptreject

s8

τ τ

start

end

reject accept

c3

check check

?

TS1 and TS2 are not branching bisimilar (although trace equivalent).

Challenge: finding the right representational bias

PAGE 37

a a

start endp

There is no WF-net with unique visible labels that exhibits this behavior.

Another example

PAGE 38

a b

start endp1

c

τ

p1

(a)

a b

start endp1

c

p1

(b)

a

a b

start endp1

c

p1

(c)

There is no WF-net with unique visible labels that exhibits this behavior.

Challenge: noise and incompleteness

• To discover a suitable process model it is assumed that the event log contains a representative sample of behavior.

• Two related phenomena:−Noise: the event log contains rare and

infrequent behavior not representative for the typical behavior of the process.

− Incompleteness: the event log contains too few events to be able to discover some of the underlying control-flow structures.

PAGE 39

More on incompleteness

PAGE 40

See also chapter 3 (cross-validation, precision, recall, etc.)

PAGE 41

Challenge: Balancing Between Underfitting and Overfitting

Challenge: four competing quality criteria

PAGE 42

process discovery

fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Flower model

PAGE 43

g

ac

d

ef

b

start end

h

PAGE 44

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

990850

PAGE 45

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

99888578

PAGE 46

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

992853

Example: one log four models

PAGE 47

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

N2 : fitness = -, precision = +, generalization = -, simplicity = +

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

process discovery

fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Model N1

PAGE 48

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

Model N2

PAGE 49

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N2 : fitness = -, precision = +, generalization = -, simplicity = +

Model N3

PAGE 50

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

Model N4

PAGE 51

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

Why is process mining such a difficult problem?

• There are no negative examples (i.e., a log shows what has happened but does not show what could not happen).

• Due to concurrency, loops, and choices the search space has a complex structure and the log typically contains only a fraction of all possible behaviors.

• There is no clear relation between the size of a model and its behavior (i.e., a smaller model may generate more or less behavior although classical analysis and evaluation methods typically assume some monotonicity property).

PAGE 52

Creating a 2-D slice of a 3-D reality

PAGE 53

Creating a 2-D slice of a 3-D reality: the process is viewed from a specific angle, the process is scoped using a frame, and the resolution determines the granularity of the resulting model

top related