11. workflow mining discovering process models from event logs

7/31/2019 11. Workflow Mining Discovering Process Models From Event Logs

1/15

Workflow Mining: Discovering ProcessModels from Event LogsWil van der Aalst, Ton Weijters, and Laura Maruster

Abstract Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflowdesign is required in order to enact a given workflow process. Creating a workflow design is a complicated time-consuming processand, typically, there are discrepancies between the actual workflow processes and the processes as perceived by the management.Therefore, we have developed techniques for discovering workflow models. The starting point for such techniques is a so-calledworkflow log containing information about the workflow process as it is actually being executed. We present a new algorithm toextract a process model from such a log and represent it in terms of a Petri net. However, we will also demonstrate that it is notpossible to discover arbitrary workflow processes. In this paper, we explore a class of workflow processes that can be discovered. Weshow that the -algorithm can successfully mine any workflow represented by a so-called SWF-net.

Index Terms Workflow mining, workflow management, data mining, Petri nets.

1 INTRODUCTION

DURING the last decade, workflow management conceptsand technology [3], [5], [15], [26], [28] have beenapplied in many enterprise information systems. Workflowmanagement systems such as Staffware, IBM MQSeries,COSA, etc., offer generic modeling and enactment capabil-ities for structured business processes. By making graphicalprocess definitions, i.e., models describing the life-cycle of atypical case (workflow instance) in isolation, one canconfigure these systems to support business processes.Besides pure workflow management systems, many othersoftware systems have adopted workflow technology.Consider, for example, ERP (Enterprise Resource Planning)systems such as SAP, PeopleSoft, Baan and Oracle, CRM(Customer Relationship Management) software, etc. De-spite its promise, many problems are encountered whenapplying workflow technology. One of the problems is thatthese systems require a workflow design, i.e., a designer hasto construct a detailed model accurately describing therouting of work. Modeling a workflow is far from trivial: Itrequires deep knowledge of the workflow language andlengthy discussions with the workers and managementinvolved.

Instead of starting with a workflow design, we start bygathering information about the workflow processes as theytake place. We assume that it is possible to record events

such that1. each event refers to a task (i.e., a well-defined step in

the workflow),2. each event refers to a case (i.e., a workflow instance),

and

3. events are totally ordered (i.e., in the log events arerecorded sequentially, even though tasks may beexecuted in parallel).

Any information system using transactional systems suchas ERP, CRM, or workflow management systems will offerthis information in some form. Note that we do not assumethe presence of a workflow management system. The onlyassumption we make is that it is possible to collectworkflow logs with event data. These workflow logs areused to construct a process specification which adequatelymodels the behavior registered. We use the term process

mining for the method of distilling a structured processdescription from a set of real executions.

To illustrate the principle of process mining, we considerthe workflow log shown in Table 1. This log containsinformation about five cases (i.e., workflow instances). Thelog shows that for four cases (1, 2, 3, and 4), the tasks A, B,C, and D have been executed. For the fifth case, only threetasks are executed: tasks A, E, and D. Each case starts withthe execution of A and ends with the execution of D. If task B is executed, then task C is also executed. However,for some cases, task C is executed before task B. Based onthe information shown in Table 1 and by making someassumptions about the completeness of the log (i.e.,

assuming that the cases are representative and a sufficientlarge subset of possible behaviors is observed), we candeduce for example the process model shown in Fig. 1. Themodel is represented in terms of a Petri net [39]. The Petrinet starts with task A and finishes with task D. These tasksare represented by transitions. After executing A, there is achoice between either executing B and C in parallel, or justexecuting task E. To execute B and C in parallel, twononobservable tasks (AND-split and AND-join) have beenadded. These tasks have been added for routing purposesonly and are not present in the workflow log. Note that weassume that two tasks are in parallel if they appear in anyorder. However, by distinguishing between start events and

end events for tasks, it is possible to explicitly detect

1128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 9, SEPTEMBER 2004

. The authors are with the Department of Technology Management,Eindhoven University of Technology, PO Box 513, NL-5600 MB,Eindhoven, The Netherlands.E-mail: {w.m.p.v.d.aalst, A.J.M.M.Weijters, l.maruster}@tm.tue.nl.

Manuscript received 22 Mar. 2002; revised 15 May 2003; accepted 30 July2003.For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number 116148.1041-4347/04/$20.00 2004 IEEE Published by the IEEE Computer Society


2/15

parallelism. Start events and end events can also be used toindicate that tasks take time. However, to simplify thepresentation, we assume tasks to be atomic without losinggenerality. In fact, in our tool EMiT [4], we refine this evenfurther and assume a customizable transaction model fortasks involving events like start task, withdraw task,resume task, complete task, etc. [4]. Nevertheless, it isimportant to realize that such an approach only works if events like these are recorded at the time of theiroccurrence.

The basic idea behind process mining, also referred to asworkflow mining, is to construct Fig. 1 from the informationgiven in Table 1. In this paper, we will present a newalgorithm and prove its correctness.

Process mining is useful for at least two reasons. First of all, it could be used as a tool to find out how people and/orprocedures really work. Consider, for example, processessupported by an ERP system like SAP (e.g., a procurementprocess). Such a system logs all transactions, but in manycases does not enforce a specific way of working. In such anenvironment, process mining could be used to gain insightin the actual process. Another example would be the flow of patients in a hospital. Note that in such an environment, allactivities are logged, but information about the underlyingprocess is typically missing. In this context, it is important

to stress that management information systems provideinformation about key performance indicators like resourceutilization, flow times, and service levels, but not about theunderlying business processes (e.g., causal relations, order-ing of activities, etc.). Second, process mining could be usedfor Delta analysis, i.e., comparing the actual process withsome predefined process. Note that in many situations,

there is a descriptive or prescriptive process model. Such amodel specifies how people and organizations are as-sumed/expected to work. By comparing the descriptive orprescriptive process model with the discovered model,discrepancies between both can be detected and used toimprove the process. Consider, for example, the so-calledreference models in the context of SAP. These modelsdescribe how the system should be used. Using processmining, it is possible to verify whether this is the case. Infact, process mining could also be used to compare differentdepartments/organizations using the same ERP system.

An additional benefit of process mining is that informa-tion about the way people and/or procedures really work

and differences between actual processes and predefinedprocesses can be used to trigger Business Process Reengi-neering (BPR) efforts or to configure process-awareinformation systems (e.g., workflow, ERP, and CRMsystems).

Table 1 contains the minimal information we assume to be present. In many applications, the workflow log containsa timestamp for each event and this information can beused to extract additional causality information. Moreover,we are also interested in the relation between attributes of the case and the actual route taken by a particular case. Forexample, when handling traffic violations: Is the make of acar relevant for the routing of the corresponding trafficviolations? (For example, People driving a Ferrari alwayspay their fines in time.)

For this simple example, it is quite easy to construct aprocess model that is able to regenerate the workflow log.For larger workflow models this is much more difficult. Forexample, if the model exhibits alternative and parallelrouting, then the workflow log will typically not contain allpossible combinations. Consider 10 tasks which can beexecuted in parallel. The total number of interleavings is 10!= 3628800. It is not realistic that each interleaving is presentin the log. Moreover, certain paths through the processmodel may have a low probability and, therefore, remainundetected. Noisy data (i.e., logs containing rare events,exceptions, and/or incorrectly recorded data) can furthercomplicate matters.

In this paper, we do not focus on issues such as noise. Weassume that there is no noise and that the workflow log

VAN DER AALST ET AL.: WORKFLOW MINING: DISCOVERING PROCESS MODELS FROM EVENT LOGS 1129

TABLE 1A Workflow Log

Fig. 1. A process model corresponding to the workflow log.


3/15

contains sufficient information. Under these ideal circum-stances, we investigate whether it is possible to rediscoverthe workflow process, i.e., for which class of workflowmodels is it possible to accurately construct the model bymerely looking at their logs. This is not as simple as itseems. Consider, for example, the process model shown inFig. 1. The corresponding workflow log shown in Table 1does not show any information about the AND-split andthe AND-join. Nevertheless, they are needed to accuratelydescribe the process. These and other problems areaddressed in this paper. For this purpose, we use workflownets (WF-nets). WF-nets are a class of Petri nets specificallytailored toward workflow processes. Fig. 1 shows anexample of a WF-net.

To illustrate the rediscovery problem we use Fig. 2.Suppose we have a log based on many executions of theprocess described by a WF-net WF 1 . Based on thisworkflow log and using a mining algorithm, we constructa WF-net WF 2 . An interesting question is whetherWF 1 WF 2 . In this paper, we explore the class of WF-netsfor which WF 1 WF 2 . Note that the rediscovery problemis only addressed to explore the theoretical limits of processmining and to test the algorithm presented in this paper.We have used these results to develop tools that candiscover unknown processes and have successfully appliedthese tools to mine real processes.

The remainder of this paper is organized as follows: First,weintroducesome preliminaries, i.e., Petri nets andWF-nets.In Section 3, we formalize the problem addressed in thispaper. Section 4 discusses the relation between causalitydetected in the log and places connecting transitions in theWF-net. Based on these results, an algorithm for processmining is presented. The quality of this algorithm issupported by the fact that it is able to rediscover a large classof workflow processes. The paper finishes with an overviewof related work and some conclusions.

2 P RELIMINARIESThis section introduces the techniques used in the remain-der of this paper. First, we introduce standard Petri-netnotations, then we define the class of WF-nets.

2.1 Petri NetsWe use a variant of the classic Petri-net model, namely,Place/Transition nets. For an elaborate introduction to Petri

nets, the reader is referred to [12], [37], [39].

Definition 2.1 (P/T-nets) 1 . An Place/Transition net, or simplyP/T-net, is a tuple P ; T; F , where:

1. P is a finite set of places .2. T is a finite set of transitions such that P \ T ; .3. F P T [ T P is a set of directed arcs, called

the flow relation . A marked P/T-net is a pair N; s, where N P ; T; F is aP/T-net and where s is a bag overP denoting the marking of the net. The set of all marked P/T-nets is denotedN .

A marking is a bag over the set of places P , i.e., it is afunction from P to the natural numbers. We use square brackets for the enumeration of a bag, e.g., a2 ;b ;c3 denotesthe bag with two as, one b, and three cs. The sum of two bags (X Y ), the difference ( X Y ), the presence of anelement in a bag ( a 2 X ), and the notion of subbags ( X Y )are defined in a straightforward way and they can handle amixture of sets and bags.

Let N P ; T; F be a P/T-net. Elements of P [ T arecalled nodes. A node x is an input node of another node y iff there is a directed arc from x to y (i.e., x; y 2 F ). Node x isan output node of y iff y; x 2 F . For any x 2 P [ T ,

N x

f y j y; x 2 F g and xN f y j x; y 2 F g; the superscript N

may be omitted if clear from the context.Fig. 1 shows a P/T-net consisting of eight places and

seven transitions. Transition A has one input place and oneoutput place, transition AND-split has one input place andtwo output places, and transition AND-join has two inputplaces and one output place. The black dot in the inputplace of A represents a token. This token denotes the initialmarking. The dynamic behavior of such a marked P/T-netis defined by a firing rule.

Definition 2.2 (Firing rule). Let N P ; T; F ; s be amarked P/T-net. Transition t 2 T is enabled , denotedN; s ti , iff t s. The firing rule i N T N isthe smallest relation satisfying for anyN P ; T; F ; s 2 N and any t 2 T , N; s ti ) N; s tiN; s t t .

In the marking shown in Fig. 1 (i.e., one token in thesource place), transition A is enabled and firing thistransition removes the token from the input place and putsa token in the output place. In the resulting marking, two


Fig. 2. The rediscovery problem: For which class of WF-nets is it guaranteed that WF 2 is equivalent to WF 1?

1. In the literature, the class of Petri nets introduced in Definition 2.1 issometimes referred to as the class of (unlabeled) ordinary P/T-nets todistinguish it from the class of Petri nets that allows more than one arc

between a place and a transition.


4/15

transitions are enabled: E and AND-split . Although both areenabled, only one can fire. If AND-split fires, one token isconsumed and two tokens are produced.Definition 2.3 (Reachable markings). Let N; s 0be a marked

P/T-net in N . A marking s is reachable from the initialmarking s0 iff there exists a sequence of enabled transitionswhose firing leads froms0 to s. The set of reachable markingsof N; s 0 is denoted N; s 0i .

The marked P/T-net shown in Fig. 1 has eight reachablemarkings. Sometimes, it is convenient to know the sequenceof transitions that are fired in order to reach some givenmarking. This paper uses the following notations forsequences. Let A be some alphabet of identifiers. A sequenceof lengthn , for some natural number n 2 IN , over alphabet Ais a function : f 0; . . . ; n 1g A. The sequence of lengthzero is called the empty sequence and written ". For thesake of readability, a sequence of positive length is usuallywritten by juxtaposing the function values: For example, asequence f0; a; 1; a; 2; bg, for a; b 2 A, is written

aab. The set of all sequences of arbitrary length overalphabet A is written A.Definition 2.4 (Firing sequence). Let N; s 0 with N

P ; T; F be a marked P/T net. A sequence 2 T is called afiring sequence of N; s 0 iff, for some natural numbern 2 IN , there exist markings s1 ; . . . ; sn and transitionst1 ; . . . ; tn 2 T such that t1 . . . tn and, for all i with0 i < n , N; s it i 1 i and s i 1 s i t i 1 t i 1 . (Notethat n 0 implies that " and that " is a firing sequence of N; s 0.) Sequence is said to be enabled in marking s0 ,denoted N; s 0 i . Firing the sequence results in a markingsn , denoted N; s 0 iN; s n .

Definition 2.5 (Connectedness). A net N P ; T; F isweakly connected , or simply connected , iff, for every twonodes x and y in P [ T , xF [ F 1y, where R1 is theinverse and R the reflexive and transitive closure of a relationR. Net N is strongly connected iff, for every two nodesx andy, xF y.

We assume that all nets are weakly connected and haveat least two nodes. The P/T-net shown in Fig. 1 isconnected, but not strongly connected because there is nodirected path from the sink place to the source place, orfrom D to A, etc.Definition 2.6 (Boundedness, safeness). A marked netN

P ; T; F ; s is bounded iff the set of reachable markings

N; s i is finite. It is safe iff, for anys02 N; s i and any p 2 P ,s0 p 1. Note that safeness implies boundedness.

The marked P/T-net shown in Fig. 1 is safe (and,therefore, also bounded) because none of the eight reach-able states puts more than one token in a place.Definition 2.7 (Dead transitions, liveness). Let N

P ; T; F ; sbe a marked P/T-net. A transitiont 2 T is deadin N; s iff there is no reachable markings0 2 N; s i such thatN; s 0ti . N; s is live iff, for every reachable markings0 2N; s i and t 2 T , there is a reachable markings002 N; s 0isuch that N; s 00ti . Note that liveness implies the absence of dead transitions.

None of the transitions in the marked P/T-net shown inFig. 1 is dead. However, the marked P/T-net is not live sinceit is not possible to enable each transition continuously.

2.2 Workflow NetsMost workflow systems offer standard building blocks suchas the AND-split, AND-join, OR-split, and OR-join [5], [15],

[26], [28]. These are used to model sequential, conditional,parallel, and iterative routing (WFMC [15]). Clearly, a Petrinet can be used to specify the routing of cases. Tasks aremodeled by transitions and causal dependencies aremodeled by places and arcs. In fact, a place correspondsto a condition which can be used as pre and/or postcondi-tion for tasks. An AND-split corresponds to a transitionwith two or more output places, and an AND-joincorresponds to a transition with two or more input places.OR-splits/OR-joins correspond to places with multipleoutgoing/ingoing arcs. Given the close relation betweentasks and transitions, we use the terms interchangeably.

A Petri net which models the control-flow dimension of a

workflow, is called a WorkFlow net(WF-net). It should benoted that a WF-net specifies the dynamic behavior of asingle case in isolation.Definition 2.8 (Workflow nets). Let N P ; T; F be a P/T-

net and t a fresh identifier not inP [ T . N is a workflow net(WF-net) iff:

1. object creation:P contains an input place i such thati ; ,

2. object completion:P contains an output place o suchthat o ;, and

3. connectedness:N P; T [ f tg; F [ f o; t; t; i gisstrongly connected.

The P/T-net shown in Fig. 1 is a WF-net. Note that,although the net is not strongly connected, the short-circuitednet N P; T [ f tg; F [ f o; t; t; i g (i.e., the net withtransition t connecting o to i) is strongly connected. Evenif a net meets all the syntactical requirements stated inDefinition 2.8, the corresponding process may exhibit errorssuch as deadlocks, tasks which can never become active,livelocks, garbage being left in the process after termination,etc. Therefore, we define the following correctness criterion.Definition 2.9 (Sound). Let N P ; T; F be a WF-net with

input place i and output place o. N is sound iff:

1. safeness:N; i is safe,

2. proper completion: for any markings 2 N; i i, o 2 simplies s o ,3. option to complete: for any markings 2 N; i i,

o 2 N; s i , and4. absence of dead tasks:N; i contains no dead

transitions.The set of all sound WF-nets is denotedW .

The WF-net shown in Fig. 1 is sound. Soundness can beverified using standard Petri-net-based analysis techniques.In fact, soundnesscorresponds to liveness andsafeness of thecorresponding short-circuited net [1], [2], [5]. This way,efficient algorithms andtoolscanbeapplied. An example of a

tool tailored toward the analysis of WF-nets is Woflan [47].



5/15

3 THE REDISCOVERY P ROBLEMAfter introducing some preliminaries, we return to the topicof this paper: workflow mining. The goal of workflow miningis to find a workflow model (e.g., a WF-net) on the basis of aworkflow log. Table 1 shows an example of a workflow log.Note that the ordering of events within a case is relevant,

while the ordering of events among cases is of noimportance. Therefore, we define a workflow log as follows.Definition 3.1 (Workflow trace, Workflow log). Let T be a

set of tasks. 2 T is a workflow trace and W 2 PT is aworkflow log .2

The workflow trace of case 1 in Table 1 is ABCD . Theworkflow log corresponding to Table 1 is

f ABCD; ACBD; AED g:

Note that in this paper, we abstract from the identity of cases. Clearly, the identity and the attributes of a case arerelevant for workflow mining. However, for the theoretical

results in this paper, we can abstract from this. For similarreasons, we abstract from the frequency of workflow traces.In Table 1, workflow trace ABCD appears twice (case 1 andcase 3), workflow trace ACBD also appears twice (case 2and case 4), and workflow trace AED (case 5) appears onlyonce. These frequencies are not registered in the workflowlog f ABCD; ACBD; AED g. Note that when dealing withnoise, frequencies are of the utmost importance. However,in this paper, we do not deal with issues such as noise.Therefore, this abstraction is made to simplify notation. Forreaders interested in how we deal with noise and relatedissues, we refer to [31], [32], [48], [49], [50].

To find a workflow model on the basis of a workflow log,

the log should be analyzed for causal dependencies, e.g., if atask is always followed by another task, it is likely that thereis a causal relation between both tasks. To analyze theserelations, we introduce the following notations.Definition 3.2 (Log-based ordering relations). Let W be a

workflow log overT , i.e., W 2 PT . Let a; b 2 T :

. a > W b iff there is a trace t1 t2 t3 . . . tn1 and i 2f 1; . . . ; n 2g such that 2 W and t i a andt i 1 b,

. a W b iff a > W b and b 6> W a ,

. a# W b iff a 6> W b and b 6> W a , and

. akW b iff a > W b and b > W a.

Consider the workflow log W f ABCD;ACBD;AED g(i.e., the log shown in Table 1). Relation > W describes whichtasks appeared in sequence (onedirectly following theother).Clearly, A > W B , A > W C , A > W E , B > W C , B > W D ,C > W B , C > W D , and E > W D . Relation W can becomputed from > W and is referred to as the (direct) causalrelation derived from workflow log W . A W B , A W C ,A W E , B W D , C W D , and E W D . Note that B 6 W C because C > W B . Relation kW suggests potential paralle-lism. For log W , tasks B and C seem to be in parallel, i.e.,B kW C and C kW B . If two tasks can follow each other directlyin any order, then all possible interleavings are present and,

therefore, they are likely to be in parallel. Relation # W givespairs of transitions that never follow each other directly. Thismeans that thereareno directcausal relationsand parallelismis unlikely.Property 3.1. Let W be a workflow log overT . For any a; b 2 T :

a W b, or b W a, or a# W b, or akW b. Moreover, therelations W , 1W , # W , and kW are mutually exclusive and partition T T .3

This property can easy be verified. Note that

W > W n > 1W ;1W >

1W n > W ;

# W T T n > W [ > 1W , kW > W \ > 1

W . Therefore,T T W [ 1W [ # W [ k W . If no confusion is possible,the subscript W is omitted.

To simplify the use of logs and sequences, we introducesome additional notations.Definition 3.3 ( 2 , first , last ). Let A be a set, a 2 A, and

a1a2 . . . an 2 A a sequence overA of length n . 2 , first ,

and last are defined as follows:1. a 2 iff a 2 f a1 ; a2 ; . . . an g,2. first a1 , if n ! 1, and3. last an , if n ! 1.

To reason about the quality of a workflow miningalgorithm, we need to make assumptions about thecompleteness of a log. For a complex process, a handful of traces will not suffice to discover the exact behavior of theprocess. Relations W , 1W , # W , and kW will be crucialinformation for any workflow-mining algorithm. Sincethese relations can be derived from > W , we assume thelog to be complete with respect to this relation.

Definition 3.4 (Complete workflow log). Let N P ; T; F be a sound WF-net, i.e.,N 2 W . W is a workflow log of N iff W 2 PT and every trace 2 W is a firing sequence of N starting in state i and ending in o , i.e., N; i iN; o .W is a complete workflow log of N iff 1) for any workflow logW 0 of N : > W 0 > W , and 2) for any t 2 T there is a 2 W such that t 2 .

A workflow log of a sound WF-net only contains behaviors that can be exhibited by the correspondingprocess. A workflow log is complete if all tasks thatpotentially directly follow each other, in fact, directly followeach other in some trace in the log. Note that transitions thatconnect the input place i of a WF-net to its output place oare invisible for > W . Therefore, the second requirementhas been added. If there are no such transitions, thisrequirement can be dropped as is illustrated by thefollowing property.Property 3.2. Let N P ; T; F be a sound WF-net. If W is a

complete workflow log of N , then

f t 2 T j 9t02 T t > W t0_ t0 > W tg ft 2 T j t 62 i \ og:

Proof. Consider a transition t 2 T . Since N is sound there isfiring sequence containing t. If t 2 i \ o, then this


2. PT

is the powerset of T

, i.e.,W T

.

3. 1W is the inverse of relation W , i .e. , 1

W fy; x 2 T

T

jx

W yg.


6/15

sequence has length 1 and t cannot appear in > W because this is the only firing sequence containing t . If t 62 i \ o, then the sequence has at least length 2, i.e.,t is directly preceded or followed by a transition and,therefore, appears in > W . tu

The definition of completeness given in Definition 3.4may seem arbitrary, but it is not. Note that it would beunrealistic to assume that all possible firing sequences arepresent in the log. First of all, the number of possiblesequences may be infinite (in case of loops). Second, parallelprocesses typically have an exponential number of statesand, therefore, the number of possible firing sequences may be enormous. Finally, even if there is no parallelism and noloops but just N binary choices, the number of possiblesequences may be 2N . Therefore, we need a weaker notion of completeness. If there is no parallelism and no loops but justN binary choices, the number of cases required may be aslittle as 2 using our notion of completeness. Of course, for alarge N , it is unlikely that all choices are observed in just twocases, but it still indicates that this requirement is consider-ably less demanding than observing all possible sequences.The same holds for processes with loops and parallelism. If aprocess has N sequential fragments which each exhibitparallelism, the number of cases needed to observe allpossible combinations is exponential in the number of fragments. Using our notion of completeness, this is notthe case. One could consider even weaker notions of completeness, however, as will be shown in the remainder,even this notion of completeness (i.e., Definition 3.4) is insome situations too weak to detect certain advanced routingpatterns.

We will formulate the rediscovery problem introduced inSection 1 assuming a complete workflow log as described inDefinition 3.4. Before formulating this problem, we definewhat it means for a WF-net to be rediscovered.Definition 3.5 (Ability to rediscover). Let N P ; T; F be a

sound WF-net, i.e., N 2 W , and let be a mining algorithmwhich maps workflow logs of N onto sound WF-nets, i.e.,

: PT W . If for any complete workflow logW of N , themining algorithm returns N (modulo renaming of places),then is able torediscover N .

Note that no mining algorithm is able to find names of places. Therefore, we ignore place names, i.e., is able to

rediscover N iff W N modulo renaming of places.The goal of this paper is twofold. First of all, we arelooking for a mining algorithm that is able to rediscoversound WF-nets, i.e., based on a complete workflow log, thecorresponding workflow process model can be derived.Second, given such an algorithm, we want to indicate theclass of workflow nets which can be rediscovered. Clearly,this class should be as large as possible. Note that there isno mining algorithm which is able to rediscover all soundWF-nets. For example, if in Fig. 1 we add a place pconnecting transitions A and D , there is no miningalgorithm able to detect p since this place is implicit, i.e.,the addition of the place does not change the behavior of the

net and, therefore, is not visible in the log.

To conclude, we summarize the rediscovery problem:Find a mining algorithm able to rediscover a large classof sound WF-nets on the basis of complete workflow logs.This problem was illustrated in the introduction using Fig. 2.

4 WORKFLOW MINING

In this section, the rediscovery problem is tackled. Beforewe present a mining algorithm able to rediscover a largeclass of sound WF-nets, we investigate the relation betweenthe causal relations detected in the log (i.e., W ) and thepresence of places connecting transitions. First, we showthat causal relations in W imply the presence of places.Then, we explore the class of nets for which the reverse alsoholds. Based on these observations, we present a miningalgorithm.

4.1 Causal Relations Imply Connecting PlacesIf there is a causal relation between two transitionsaccording to the workflow log, then there has to be a placeconnecting these two transitions.Theorem 4.1. Let N P ; T; F be a sound WF-net and letW

be a complete workflow log of N . For any a; b 2 T : a W bimplies a \ b 6 ; .

Proof. Assume a W b and a \ b ; . We will showthat this leads to a contradiction and, thus, prove thetheorem. Since a > b , there is a firing sequence t1 t2 t3 . . . tn1 and i 2 f 1; . . . ; n 2g such that 2 W andt i a and t i 1 b. Let s be the state just before firing a,i.e., N; i 0iN; s with 0 t1 . . . t i1 . Let s0 be themarking after firing b in state s, i.e., N; s biN; s 0. Notethat b is enabled in s because it is enabled after firing aand a \ b ; (i.e., a does not produce tokens for any

of the input places of b). a cannot be enabled in s0;

otherwise, b > a and not a W b. Since a is enabled in s but not in s0, b consumes a token from an input place of aand does not return it, i.e., b n b \ a 6 ; . Thereis a place p such that p 2 a, p 2 b, and p 62 b .Moreover, a \ b ; . Therefore, p 62 a . Since thenet is safe, p contains precisely one token in marking s.This token is consumed by t i a and not returned.Hence, b cannot be enabled after firing t i . Therefore, cannot be a firing sequence of N starting in i. tu

Let

N 1 fi; p1 ; p2 ; p3 ; p4 ; og; f A;B;C;D g; fi; A; A; p1; A; p2; p1 ; B ; B; p3; p2 ; C ; C; p4; p3 ; D; p4 ; D; D; og:

(This is the WF-net with B and C in parallel, see N 1 in Fig. 4)W 1 f ABCD; ACBD g is a complete log over N 1 . SinceA W 1 B , there has to be a place between A and B . Thisplace corresponds to p1 in N 1 . Let

N 2 fi; p1 ; p2 ; og; f A;B;C;D g; fi; A; A; p1; p1 ; B ;B; p2; p1 ; C ; C; p2; p2 ; D; D; og:

(This is the WF-net with a choice between B and C , see N 2in Fig. 4.) W 2 f ABD; ACD g is a complete log over N 2 .Since A W 2 B , there has to be a place between A and B .

Similarly, A W 2 C and, therefore, there has to be a place



7/15

between A and C . Both places correspond to p1 in N 1 . Notethat in the first example ( N 1=W 1), the two causal relationsA W 1 B and A W 1 C correspond to two different places,while in the second example, the two causal relationsA W 1 B and A W 1 C correspond to a single place.

4.2 Connecting Places Often Imply CausalRelations

In this section, we investigate which places can be detected by simply inspecting the log. Clearly, not all places can be

detected. For example, places may be implicit which meansthat they do not affect the behavior of the process. Theseplaces remain undetected. Therefore, we limit our investi-gation to WF-nets without implicit places.Definition 4.1 (Implicit place). Let N P ; T; F be a P/T-

net with initial marking s. A placep 2 P is called implicit inN; s iff, for all reachable markingss0 2 N; s i and transi-tions t 2 p , s0 ! t n f pg ) s0 ! t.

Fig. 1 contains no implicit places. However, as indicated before, adding a place p connecting transition A and Dyields an implicit place. No mining algorithm is able todetect p since the addition of the place does not change the

behavior of the net and, therefore, is not visible in the log.For the rediscovery problem, it is very important that thestructure of the WF-net clearly reflects its behavior. There-fore, we also rule out the constructs shown in Fig. 3. The leftconstruct illustrates the constraint that choice and synchro-nization should never meet. If two transitions share aninput place and, therefore, fight for the same token, theyshould not require synchronization. This means that choices(places with multiple output transitions) should not bemixed with synchronizations. The right-hand construct inFig. 3 illustrates the constraint that if there is a synchroniza-tion, all preceding transitions should have fired, i.e., it is notallowed to have synchronizations directly preceded by anOR-join. WF-nets which satisfy these requirements arenamed structured workflow nets.Definition 4.2 (SWF-net). A WF-net N P ; T; F is an

SWF-net (Structured workflow net) iff:

1. For all p 2 P and t 2 T with p; t 2 F : j p j> 1implies j tj 1.

2. For all p 2 P and t 2 T with p; t 2 F : j tj > 1implies j pj 1.

3. There are no implicit places.At first sight, the three requirements in Definition 4.2

seem quite restrictive. From a practical point of view, this isnot the case. First of all, SWF-nets allow for all routing

constructs encountered in practice, i.e., sequential, parallel,

conditional, and iterative routing are possible and the basicworkflow building blocks (AND-split, AND-join, OR-split,and OR-join) are supported. Second, WF-nets that are notSWF-nets are typically difficult to understand and should be avoided, if possible. Third, many workflow managementsystems only allow for workflow processes that correspondto SWF-nets. The latter observation can be explained by thefact that most workflow management systems use alanguage with separate building blocks for OR-splits andAND-joins. Finally, there is a very pragmatic argument. If we drop any of the requirements stated in Definition 4.2,relation > W does not contain enough information tosuccessfully mine all processes in the resulting class.

The reader familiar with Petri nets will observe thatSWF-nets belong to the class of free-choice nets [12]. Thisallows us to use efficient analysis techniques and advancedtheoretical results. For example, using these results, it ispossible to decide soundness in polynomial time [2].

SWF-nets also satisfy another interesting property.Property 4.1. Let N P ; T; F be an SWF-net. For anya; b 2

T and p1 ; p2 2 P : if p1 2 a \ b and p2 2 a \ b, then p1 p2 .

This property follows directly from the definition of SWF-nets and states that no two transitions are connected by multiple places. This property illustrates that the

structure of an SWF-net clearly reflects its behavior and


Fig. 3. Two constructs not allowed in SWF-nets.

Fig. 4. Five sound SWF-nets.


8/15

vice versa. This is exactly what we need to be able torediscover a WF-net from its log.

We already showed that causal relations in W implythe presence of places. Now, we try to prove the reverse forthe class of SWF-nets. First, we focus on the relation between the presence of places and > W .Theorem 4.2. Let N P ; T; F be a sound SWF-net and letW

be a complete workflow log of N . For any

a; b 2 T : a \ b 6 ;

implies a > W b.Proof. See [6]. tu

Unfortunately, a \ b 6 ; does not imply a W b. Toillustrate this, consider Fig. 4. For the first two nets (i.e., N 1and N 2), two tasks are connected iff there is a causalrelation. This does not hold for N 3 and N 4 . In N 3 , A W 3 B ,A W 3 D , and B W 3 D . However, not B W 3 B . Never-theless, there is a place connecting B to B . In N 4 , although

there are places connecting B to C and vice versa, B 6 W 3 C and B 6 W 3 C . These examples indicate that loops of lengthone (see N 3) and length two (see N 4) are harmful.Fortunately, loops of length three or longer are no problemas is illustrated in the following theorem.Theorem 4.3. Let N P ; T; F be a sound SWF-net and let

W be a complete workflow log of N . For any a; b 2 T :a \ b 6 ; and b \ a ; implies a W b.

Proof. See [6]. tu

Acyclic nets have no loops of length one or length two.Therefore, it is easy to derive the following property.

Property 4.2. Let N P ; T; F be an acyclic sound SWF-netand let W be a complete workflow log of N . For any a; b 2 T :a \ b 6 ; iff a W b.

The results presented thus far focus on the correspon-dence between connecting places and causal relations.However, causality ( W ) is just one of the four log-basedordering relations defined in Definition 4.2. The followingtheorem explores the relation between the sharing of inputand output places and # W .Theorem 4.4. Let N P ; T; F be a sound SWF-net such that

for anya; b 2 T : a \ b ; or b \ a ; and let W bea complete workflow log of N .

1. If a; b 2 T and a \ b 6 ; , then a# W b.2. If a; b 2 T and a \ b 6 ; , then a# W b.3. If a;b;t 2 T , a W t, b W t , and a# W b, then

a \ b \ t 6 ; .4. If a;b;t 2 T , t W a, t W b, and a# W b, then

a \ b \ t 6 ; .Proof. See [6]. tu

The relations W , 1W , # W , and kW are mutuallyexclusive. Therefore, we can derive that for sound SWF-netswith no short loops, akW b implies a \ b a \ b ; .Moreover, a W t, b W t , and a \ b \ t ; implies

akW b. Similarly, t W a, t W b, and a \ b \ t ;, also

implies akW b. These results will be used to underpin themining algorithm presented in the following section.

4.3 Mining AlgorithmBased on the results in the previous sections, we nowpresent an algorithm for mining processes. The algorithmuses the fact that for many WF-nets, two tasks are connected

iff their causality can be detected by inspecting the log.Definition 4.3 (Mining algorithm ). Let W be a workflow log

over T . W is defined as follows:

1. T W f t 2 T j 9 2 W t 2 g,2. T I f t 2 T j 9 2 W t first g,3. T O f t 2 T j 9 2 W t last g,4.

X W fA; B jA T W ^ B T W ^ 8 a2 A8b2 B a W b ^ 8 a1 ;a22 Aa1# W a2^ 8 b1 ;b22 B b1# W b2g;

5.

Y W fA; B 2 X W j 8A0;B 02X W A A0

^ B B 0) A; B A0; B 0g;

6. P W f pA;B j A; B 2 Y W g [ f iW ; oW g,7.

F W fa; pA;B j A; B 2 Y W ^ a 2 Ag[ f pA;B ; b j A; B 2 Y W ^ b 2 B g[ f iW ; t jt 2 T I g [ f t; oW j t 2 T Og;

and8. W P W ; T W ; F W .

The mining algorithm constructs a net P W ; T W ; F W .Clearly, the set of transitions T W can be derived byinspecting the log. In fact, as shown in Property 3.2, if there are no traces of length one, T W can be derived from> W . Since it is possible to find all initial transitions T I andall final transition T O , it is easy to construct the connections between these transitions and iW and oW . Besides the sourceplace iW and the sink place oW , places of the form pA;B areadded. For such place, the subscript refers to the set of inputand output transitions, i.e., pA;B A and pA;B B . Aplace is added in-between a and b iff a W b. However,some of these places need to be merged in case of OR-splits/joins rather than AND-splits/joins. For this purpose,the relations X W and Y W are constructed. A; B 2 X W if there is a causal relation from each member of A to eachmember of B and the members of A and B never occur nextto one another. Note that, if a W b, b W a , or akW b, then aand b cannot be both in A (or B ). Relation Y W is derivedfrom X W by taking only the largest elements with respect toset inclusion. (See the end of this section for an example.)

Based on defined in Definition 4.3, we turn to therediscovery problem. Is it possible to rediscover WF-nets

using W ? Consider the five SWF-nets shown in Fig. 4. If



9/15

is applied to a complete workflow log of N 1, the resultingnet is N 1 modulo renaming of places. Similarly, if isapplied to a complete workflow log of N 2 , the resulting netis N 2 modulo renaming of places. As expected, is not ableto rediscover N 3 and N 4 (see Fig. 5). W 3 is like N 3 , butwithout the arcs connecting B to the place in-between A andD and two new places. W 4 is like N 4 , but the input andoutput arc of C are removed. W 3is not a WF-net since B is not connected to the rest of the net. W 4is not a WF-netsince C is not connected to the rest of the net. In both cases,two arcs are missing in the resulting net. N 3 and N 4illustrate that the mining algorithm is unable to deal withshort loops. Loops of length three or longer are no problem.For example, W 5 N 5 modulo renaming of places. Thefollowing theorem proves that is able to rediscover theclass of SWF-nets provided that there are no short loops.Theorem 4.4. Let N P ; T; F be a sound SWF-net and letW

be a complete workflow log of N . If for alla; b 2 T a \ b ; or b \ a ; , then W N modulo renaming of places.

Proof. Let W P W ; T W ; F W . Since W is complete, it iseasy to see that T T W . It remains to be proven thatevery place in N corresponds to a place in W and viceversa.

Let p 2 P . We need to prove that there is a pW 2 P W such that

N p

N W pW and pN

pW N W . If p i , i.e., the

source place or p o, i.e., the sink place, then it is easyto see that there is a corresponding place in W .Transitions in i

N [

N o can fire only once directly at the

beginning of a sequence or at the end. Therefore, theconstruction given in Definition 4.3 involving iW , oW ,

T I , and T O yields a source and sink place with identicalinput/output transitions. If p 62 f i; og, then let A

N p,

B pN , and pW pA;B . If pW is indeed a place of W , then

N p

W pW and p

N pW W

. This followsdirectly from the definition of the flow relation F W inDefinition 4.3. To prove that pW pA;B is a place of

W , we need to show that A; B 2 Y W . A; B 2 X W because

1. Theorem 4.3 implies that 8a2 A8b2B a W b,2. Theorem 4.4, item 1 implies that 8a1 ;a22 Aa1# W a2 ,

and

3. Theorem 4.4, item 2 implies that 8b1 ;b22 B b1 # W b2 .To prove that A; B 2 Y W , we need to show that it is notpossible to have A0; B 0 2 X W such that A A0, B B 0,and A; B 6 A0; B 0 (i.e., A & A0 or B & B 0). Supposethat A & A0. There is an a0 2 T n A such that 8b2 B a0 W band 8a2 Aa# W a0. Theorem 4.4, item 3 implies that a

N \

a0N \ N b 6 ; for some b 2 B . Let p0 2 aN \ a0N \ N b.Property 4.1 implies p0 p. However, a0 62 A

N p and

a0 2 N p0, and we find a contradiction ( p0 p and p0 6 p).Suppose that B & B 0. There is a b0 2 T n B such that8a2 Aa W b0 and 8b2 B b# W b0. Using Theorem 4.4, item 4and Property 4.1, we can show that this leads to acontradiction. Therefore, A; B 2 Y W and pW 2 P W .

Let pw 2 P W . Weneedto prove thatthere is a p 2 P suchthat

N p

N W pW and pN pW

N W . If pw iw or pw ow, then pw corresponds to i or o, respectively. This is a directconsequence of the construction given in Definition 4.3involving iW , oW , T I ,and T O . If pw 62 f iw; owg, then therearesets A and B such that A; B 2 Y W and pw pA;B .

N pw A and pw

N B . It remains to be proven that

there is a p 2 P such thatN p A and p

N B . SinceA; B 2 Y W implies that A; B 2 X W , for any a 2 A andb 2 B there is a place connecting a and b (use a W b andTheorem 4.1). Using Theorem 4.4, we can prove that thereis just one such place. Let p be this place. Clearly,

N p A

and pN

B . It remains to be proven thatN p A and

pN B . Suppose that a0 2

N p n A (i.e.,

N p 6 A). Select an

arbitrary a 2 A and b 2 B . Using Theorem 4.3, we canshow that a0 W b. Using Theorem 4.4, item 1, we canshow that a# W a0. This holds for any a 2 A and b 2 B .Therefore, A [ f a0g; B 2 X W . However, this is notpossible since A; B 2 Y W (A; B should be maximal).Therefore, we find a contradiction. We find a similarcontradiction if we assume that there is a b0 2 p

N n B .Therefore, we conclude that

N p A and p

N B . tu

Nets N 1 , N 2 , and N 5 shown in Fig. 4 satisfy therequirements stated in Theorem 4.4. Therefore, it is nosurprise that is able to rediscover these nets. The netshown in Fig. 1 is also an SWF-net with no short loops.Therefore, we can successfully rediscover the net if theAND-split and the AND-join are visible in the log. The latterassumption is not realistic if these two transitions do notcorrespond to real work. Given the fact the log shown inTable 1 does not list the occurrence of these events,indicates that this assumption is not valid. Therefore, theAND-split and the AND-join should be considered invi-sible. However, if we apply to this log

W f ABCD; ACBD; AED g;

then the result is quite surprising. The resulting net W is

shown in Fig. 6.


Fig. 5. The algorithm is unable to rediscover N 3 and N 4 .


10/15

To illustrate the algorithm we show the result of eachstep using the log W f ABCD; ACBD; AED g (i.e., a loglike the one shown in Table 1):

1. T W f A;B;C;D;E g,2. T I f Ag,3. T O f Dg,4.

X W ff Ag; f B g; fAg; f C g; fAg; f E g;fB g; f Dg; fC g; f Dg; fE g; f Dg;fAg; f B; E g; fAg; f C; E g; fB; E g; f Dg;fC; E g; f Dgg;

5.

Y W ff Ag; f B; E g; fAg; f C; E g; fB; E g;f Dg; fC; E g; f Dgg;

6.

P W f iW ; oW ; pfAg;f B;E g; pf Ag;f C;E g; pfB;E g;f Dg; pfC;E g;f Dgg;

7.

F W fiW ; A; A; pfAg;f B;E g; pfAg;f B;E g; B . . . ; D; oW g;

and8. W P W ; T W ; F W (as shown in Fig. 6).Although the resulting net is not an SWF-net, it is a

sound WF-net whose observable behavior is identical to thenet shown in Fig. 1. Also note that the WF-net shown inFig. 6 can be rediscovered, although it is not an SWF-net.This example shows that the applicability is not limited toSWF-nets. However, for arbitrary sound WF-nets, it is notpossible to guarantee that they can be rediscovered.

4.4 Limitations of the AlgorithmAs demonstrated through Theorem 4.4, the algorithm isable to rediscover a large class of processes. However, wedid not prove that the class of processes is maximal, i.e., thatthere is not a better algorithm able to rediscover evenmore processes. Therefore, we reflect on the requirementsstated in Definition 4.2 (SWF-nets) and Theorem 4.4 (no

short loops).

Let us first consider the requirements stated inDefinition 4.2. To illustrate the necessity of the first two

requirements consider Figs. 7 and 8. The WF-net N 6 shownin Fig. 7 is sound, but not an SWF-net since the firstrequirement is violated ( N 6 is not free-choice). If we applythe mining algorithm to a complete workflow log W 6 of N 6 ,we obtain the WF-net N 7 also shown in Fig. 7 (i.e.,

W 6 N 7). Clearly, N 6 cannot be rediscovered using .Although N 7 is a sound SWF-net, its behavior is differentfrom N 6 , e.g., workflow trace ACE is possible in N 7 but notin N 6 . This example motivates the first requirement inDefinition 4.2. The second requirement is motivated byFig. 8. N 8 violates the second requirement. If we apply themining algorithm to a complete workflow log W 8 of N 8 , we

obtain the WF-net W 8 N 9 also shown in Fig. 8.


Fig. 6. Another process model corresponding to the workflow log shownin Table 1.

Fig. 7. The nonfree-choice WF-net N 6 cannot be rediscovered by thealgorithm.

Fig. 8. WF-net N 8 cannot be rediscovered by the algorithm.Nevertheless, returns a WF-net which is behavioral equivalent.


11/15

Although N 9 is behaviorally equivalent, N 8 cannot berediscovered using the mining algorithm.

Although the requirements stated in Definition 4.2 arenecessary in order to prove that this class of workflowprocesses can be rediscovered on the basis of a completeworkflow log, the applicability is not limited to SWF-nets.The examples given in this section show that in manysituations, a behaviorally equivalent WF-net can be derived.Note that the third requirement stated in Definition 4.2 (noimplicit places) can be dropped thus allowing even more behaviorally equivalent WF-nets. Even in the cases wherethe resulting WF-net is not behaviorally equivalent, theresults are meaningful, e.g., the process represented by N 7is different from the process represented by N 6 (cf. Fig. 7).Nevertheless, N 7 is similar and captures most of the behavior.

Another requirement imposed by Theorem 4.4 is the

absence of short loops. We already showed examplesmotivating this requirement. However, it is clear that the

algorithm can be improved to deal with short loops.Unfortunately, this is less trivial than it may seem. We useFig. 9 and Fig. 10 to illustrate this. Fig. 9 shows twoWF-nets. Although both WF-nets have clearly different behaviors, their complete logs will have identical >relations. Note that the two WF-nets are not SWF-nets because of the nonfree-choice construct involving E .However, the essence of the problem is the short loopinvolving C . Because of this short loop, the algorithm willcreate for both N 10 and N 11 the Petri net where C isunconnected to the rest of the net. Clearly, this can be

improved since for complete logs of N 10 and N 11 thefollowing relations hold: B > C , C > B , C > D , D > C ,B kC , C kD , and B kD . These relations suggest that B , C , andD are in parallel. However, in any complete log, it can beseen that this is not the case since A is never followed by C .Despite this information, no algorithm will be able todistinguish N 10 and N 11 because they are identical withrespect to > . Fig. 10 gives another example demonstratingthat dealing with short loops is far from trivial. N 12 and N 13are SWF-nets that are behavioral equivalent, therefore, noalgorithm will be able to distinguish N 12 from N 13(assuming the notion of completeness and not loggingexplicit start and end events). This problem may seem

similar to Fig. 8 or examples with and without implicit

places. However, N 12 and N 13 are SWF-nets, while N 8 orexamples with implicit places are just WF-nets not satisfy-ing the requirements stated in Definition 4.2.

Despite the problems illustrated by Fig. 9 and Fig. 10, it ispossible to solve the problem using a slightly strongernotion of completeness. Assume that it is possible toidentify 1-loops and 2-loops. This can be done by lookingfor patterns AA (1-loop) and ABA (2-loop). Then, refinerelation > into a new relation > which splits the existingtransitions involved in a short loop into two or threetransitions. Tasks involved in a 1-loop are mapped ontothree transitions (e.g., start, execute, and end). Tasksinvolved in a 2-loop, but not a 1-loop are mapped ontotwo transitions (e.g., start and end). The goal of thistranslation is to make the short loops longer by usingadditional information about 1-loops and 2-loops (andwhile preserving the original properties). This refinedrelation > can be used as input for the algorithm. Thisapproach is able to deal with all possible situations exceptthe one illustrated in Fig. 10. Note that no algorithm will beable to distinguish the logs of the two processes shown inFig. 10. However, this is not a real problem since they are behaviorally equivalent. In formal terms, we can replace therequirement in Theorem 4.4 by the weaker requirement thatthere are no two transitions having identical input andoutput places. A detailed description of this preprocessingalgorithm and a proof of its correctness are, however, beyond the scope of this paper.

Besides the explicit requirements stated in Definition 4.2and Theorem 4.4, there are also a number of implicit

requirements. As indicated before, hidden tasks cannot bedetected (cf. the AND-split and AND-join in Fig. 1).Moreover, our definition of a Petri net assumes that eachtransition bears a unique label. Instead, we could have useda labeled Petri net [39]. The latter choice would have beenmore realistic but also complicate matters enormously. Thecurrent definition of a WF-net assumes that each taskappears only once in the network. Without this assumption,every occurrence of some task t could refer to one of multiple indistinguishable transitions with label t . Preli-minary investigations show that the problems of hiddentasks and duplicate tasks are highly related, e.g., aWF-net with two transitions having the same label t has acorresponding WF-net with hidden transitions but only one

transition labeled t. Moreover, there are relations between


Fig. 9. Although both WF-nets are not behavioral equivalent they areidentical with respect to > .

Fig. 10. Both SWF-nets are behavioral equivalent and, therefore, anyalgorithm will be unable to distinguish N 12 from N 13 (assuming a notionof completeness based on > ).


12/15

the explicit requirements stated in Definition 4.2 andTheorem 4.4 and the implicit requirements just mentioned,e.g., in some cases, a nonfree-choice WF-net can be madefree choice by inserting hidden transitions. These findingsindicate that the class of SWF-nets is close to the upper bound of workflow processes that can be mined success-ful ly given the not ion of completeness s tated in

Definition 3.4. To move beyond SWF-nets, we either haveto resort to heuristics or strengthen the notion of complete-ness (and thus require more observations).

To conclude this section, we consider the complexity of the algorithm. Event logs may be huge containingmillions of events. Fortunately, the algorithm is driven by relation > . The time it takes to build this relation islinear in the size of the log. The complexity of the remainingsteps in algorithm is exponential in the number of tasks.However, note that the number of tasks is typically lessthan 100 and does not depend on the size of the log.Therefore, the complexity is not a bottleneck for large-scaleapplication.

5 R ELATED WORKThe idea of process mining is not new [7], [9], [10], [11], [19],[20], [21], [22], [23], [24], [31], [32], [33], [41], [42], [43], [44],[45], [48], [49], [50]. Cook and Wolf have investigatedsimilar issues in the context of software engineeringprocesses. In [9], they describe three methods for processdiscovery: one using neural networks, one using a purelyalgorithmic approach, and one Markovian approach. Theauthors consider the latter two the most promisingapproaches. The purely algorithmic approach builds afinite state machine where states are fused if their futures(in terms of possible behavior in the next k steps) areidentical. The Markovian approach uses a mixture of algorithmic and statistical methods and is able to deal withnoise. Note that the results presented in [9] are limited tosequential behavior. Related, but in a different domain, isthe work presented in [29], [30], also using a Markovianapproach restricted to sequential processes. Cook and Wolf extend their work to concurrent processes in [10]. Theypropose specific metrics (entropy, event type counts,periodicity, and causality) and use these metrics to discovermodels out of event streams. However, they do not providean approach to generate explicit process models. Recall thatthe final goal of the approach presented in this paper is tofind explicit representations for a broad range of processmodels, i.e., we want to be able to generate a concrete Petrinet rather than a set of dependency relations betweenevents. In [11], Cook and Wolf provide a measure toquantify discrepancies between a process model and theactual behavior as registered using event-based data. Theidea of applying process mining in the context of workflowmanagement was first introduced in [7]. This work is basedon workflow graphs, which are inspired by workflowproducts such as IBM MQSeries workflow (formerly knownas Flowmark) and InConcert. In this paper, two problemsare defined. The first problem is to find a workflow graphgenerating events appearing in a given workflow log. Thesecond problem is to find the definitions of edge conditions.A concrete algorithm is given for tackling the first problem.The approach is quite different from other approaches:

Because of the nature of workflow graphs, there is no need

to identify the nature (AND or OR) of joins and splits. Asshown in [27], workflow graphs use true and false tokenswhich do not allow for cyclic graphs. Nevertheless, [7]partially deals with iteration by enumerating all occur-rences of a given task and then folding the graph. However,the resulting conformal graph is not a complete model. In[33], a tool based on these algorithms is presented. Schimm

[41], [42], [45] has developed a mining tool suitable fordiscovering hierarchically structured workflow processes.This requires all splits and joins to be balanced. Herbst alsoaddress the issue of process mining in the context of workflow management [21], [19], [20], [23], [24], [22] usingan inductive approach. The work presented in [22], [24] islimited to sequential models. The approach described in[21], [19], [20], [23] also allows for concurrency. It usesstochastic task graphs as an intermediate representationand it generates a workflow model described in theADONIS modeling language. In the induction step, tasknodes are merged and split in order to discover theunderlying process. A notable difference with otherapproaches is that the same task can appear multiple times

in the workflow model, i.e., the approach allows forduplicate tasks. The graph generation technique is similarto the approach of [7], [33]. The nature of splits and joins(i.e., AND or OR) is discovered in the transformation step,where the stochastic task graph is transformed into anADONIS workflow model with block-structured splits and joins. In contrast to the previous papers, our work [31], [32],[48], [49], [50] is characterized by the focus on workflowprocesses with concurrent behavior (rather than adding adhoc mechanisms to capture parallelism). In [48], [49], [50], aheuristic approach using rather simple metrics is used toconstruct so-called dependency/frequency tables anddependency/frequency graphs. In [31], another variantof this technique is presented using examples from thehealth-care domain. The preliminary results presented in[31], [48], [49], [50] only provide heuristics and focus onissues such as noise. The approach described in this paperdiffers from these approaches in the sense that for thealgorithm it is proven that for certain subclasses, it ispossible to find the right workflow model. In [4], the EMiTtool is presented which uses an extended version of algorithm to incorporate timing information. Note that inthis paper, there is no detailed description of thealgorithm nor a proof of its correctness.

Process mining can be seen as a tool in the context of Business (Process) Intelligence (BPI). In [18], [40], a BPItoolset on top of HPs Process Manager is described. The BPItools set includes a so-called BPI Process Mining Engine.However, this engine does not provide any techniques asdiscussed before. Instead, it uses generic mining tools suchas SAS Enterprise Miner for the generation of decision treesrelating attributes of cases to information about executionpaths (e.g., duration). In order to do workflow mining it isconvenient to have a so-called process data warehouse tostore audit trails. Such a data warehouse simplifies andspeeds up the queries needed to derive causal relations. In[13], [34], [35], [36], thedesign of such warehouse and relatedissues are discussed in the context of workflow logs.Moreover, [36] describes the PISA tool which can be usedto extract performance metrics from workflow logs. Similardiagnostics are provided by the ARIS Process Performance

Manager (PPM) [25]. The later tool is commercially available



13/15

and a customized version of PPM is the Staffware ProcessMonitor (SPM) [46] which is tailored toward miningStaffware logs. Note that the goal of the latter tools is notto extract the process model. The main focus is on clusteringand performance analysis rather than causal relations as in[7], [9], [10], [11], [19], [20], [21], [22], [23], [24], [31], [32], [33],[41], [42], [43], [44], [45], [48], [49], [50].

More from a theoretical point of view, the rediscoveryproblem discussed in this paper is related to the workdiscussed in [8], [16], [17], [38]. In these papers, the limits of inductive inference are explored. For example, in [17], it isshown that the computational problem of finding aminimum finite-state acceptor compatible with given datais NP-hard. Several of the more generic concepts discussedin these papers could be translated to the domain of processmining. It is possible to interpret the problem described inthis paper as an inductive inference problem specified interms of rules, a hypothesis space, examples, and criteria forsuccessful inference. The comparison with literature in thisdomain raises interesting questions for process mining, e.g.,how to deal with negative examples (i.e., suppose that

besides log W , there is a log V of traces that are not possible,e.g., added by a domain expert). However, despite themany relations with the work described in [8], [16], [17],[38], there are also many differences, e.g., we are mining atthe net level rather than sequential or lower-level repre-sentations (e.g., Markov chains, finite state machines, orregular expressions).

Additional related work is the seminal work on regions[14]. This work investigates which transition systems can berepresented by (compact) Petri nets (i.e., the so-calledsynthesis problem). Although the setting is different andour notion of completeness is much weaker than actuallyknowing the transition system, there are related problemssuch as duplicate transitions, etc.

6 C ONCLUSIONIn this paper, we addressed the workflow rediscoveryproblem. This problem was formulated as follows: Find amining algorithm able to rediscover a large class of soundWF-nets on the basis of complete workflow logs. Wepresented the algorithm that is able to rediscover a largeand relevant class of workflow processes (SWF-nets).Through examples, we also showed that the algorithmprovides interesting analysis results for workflow processesoutside this class. At this point in time, we are improving themining algorithm such that it is able to rediscover an evenlargerclass of WF-nets. We have tackled theproblem of short

loops and are now focusing on hidden tasks, duplicate tasks,and advanced routing constructs. However, given theobservation that the class of SWF-nets is close to the upperlimit of what one can do assuming this notion of complete-ness, new results will either provide heuristics or requirestronger notions of completeness (i.e., more observations).

It is important to see the results presented in this paperin the context of a larger effort [31], [32], [48], [49], [50]. Therediscovery problem is not a goal by itself. The overall goalis to be able to analyze any workflow log without anyknowledge of the underlying process and in the presence of noise. The theoretical results presented in this paperprovide insights that are consistent with empirical results

found earlier [31], [32], [48], [49], [50]. It is quite interesting

to see that the challenges encountered in practice match thechallenges encountered in theory. For example, the fact thatworkflow process exhibiting nonfree-choice behavior (i.e.,violating the first requirement of Definition 4.2) are difficultto mine was observed both in theory and in practice.Therefore, we consider the work presented in this paper asa stepping stone for good and robust process miningtechniques.

We have applied our workflow mining techniques to tworeal applications. The first application is in health-carewhere the flow of multidisciplinary patients is analyzed.We have analyzed workflow logs (visits to differentspecialists) of patients with peripheral arterial vasculardiseases of the Elizabeth Hospital in Tilburg and theAcademic Hospital in Maastricht. Patients with peripheralarterial vascular diseases are a typical example of multi-disciplinary patients. We have preliminary results showingthat process mining is very difficult given the spaghetti-like nature of this process. Only by focusing on specifictasks and abstracting from infrequent tasks are we able tosuccessfully mine such processes. The second applicationconcerns the processing of fines by the CJIB (Centraal Justitieel Incasso Bureau), the Dutch Judicial CollectionAgency located in Leeuwarden. We have successfullymined the process using information from 130,136 cases.The process comprises 99 tasks and has been validated bythe CJIB. This positive result shows that process mining based on the algorithm and using tools like EMiT andLittle Thumb is feasible for at least structured processes.These findings are encouraging and show the potential of the algorithm presented in this paper.

ACKNOWLEDGMENTSThe authors would like to thank Ana Karla Alves deMedeiros and Eric Verbeek for proofreading earlierversions of this paper and Boudewijn van Dongen for hisefforts in developing EMiT and solving the problem of shortloops.

REFERENCES[1] W.M.P. van der Aalst, Verification of Workflow Nets, Applica-

tion and Theory of Petri Nets, P. Azema and G. Balbo, eds., pp. 407-426, Berlin: Springer-Verlag, 1997.

[2] W.M.P. van der Aalst, The Application of Petri Nets to WorkflowManagement, The J. Circuits, Systems and Computers, vol. 8, no. 1,pp. 21-66, 1998.

[3] Business Process Management: Models, Techniques, and Em-pirical Studies, Lecture Notes in Computer Science,W.M.P. van derAalst, J. Desel, and A. Oberweis, eds., vol. 1806, Springer-Verlag,Berlin, 2000.

[4] W.M.P. van der Aalst and B.F. van Dongen, DiscoveringWorkflow Performance Models from Timed Logs, Proc. IntlConf. Eng. and Deployment of Cooperative Information Systems(EDCIS 2002), Y. Han, S. Tai, and D. Wikarski, eds., vol. 2480,pp. 45-63, 2002.

[5] W.M.P. van der Aalst and K.M. van Hee, Workflow Management: Models, Methods, and Systems.Cambridge, Mass.: MIT Press, 2002.

[6] W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster,Workflow Mining: Which Processes can be Rediscovered?BETA Working Paper Series, WP 74, Eindhoven Univ. of Technology, Eindhoven, 2002.

[7] R. Agrawal, D. Gunopulos, and F. Leymann, Mining ProcessModels from Workflow Logs, Proc. Sixth Intl Conf. Extending

Database Technology, pp. 469-483, 1998.



14/15

[8] D. Angluin and C.H. Smith, Inductive Inference: Theory andMethods, Computing Surveys, vol. 15, no. 3, pp. 237-269, 1983.

[9] J.E. Cook and A.L. Wolf, Discovering Models of SoftwareProcesses from Event-Based Data, ACM Trans. Software Eng.and Methodology, vol. 7, no. 3, pp. 215-249, 1998.

[10] J.E. Cook and A.L. Wolf, Event-Based Detection of Concurrency,Proc. Sixth Intl Symp. the Foundations of Software Eng. (FSE-6),pp. 35-45, 1998.

[11] J.E. Cook and A.L. Wolf, Software Process Validation: Quantita-

tively Measuring the Correspondence of a Process to a Model, ACM Trans. Software Eng. and Methodology, vol. 8, no. 2, pp. 147-176, 1999.

[12] J. Desel and J. Esparza, Free Choice Petri Nets, Cambridge Tractsin Theoretical Computer Science,vol. 40, Cambridge, UK: CambridgeUniv. Press, 1995.

[13] J. Eder and G.E. Olivotto, and W. Gruber, A Data Warehouse forWorkflow Logs, Proc. Intl Conf. Eng. and Deployment of Cooperative Information Systems (EDCIS 2002), Y. Han, S. Tai, andD. Wikarski, eds., pp. 1-15, 2002.

[14] A. Ehrenfeucht, G. Rozenberg, Partial (Set) 2-StructuresPart 1and Part 2, Acta Informatica, vol. 27, no. 4, pp. 315-368, 1989.

[15] Workflow Handbook 2001, Workflow Management Coalition,L. Fischer, ed. Lighthouse Point, Fla.: Future Strategies, 2001.

[16] E.M. Gold, Language Identfication in the Limit, Information andControl, vol. 10, no. 5, pp. 447-474, 1967.

[17] E.M. Gold, Complexity of Automaton Identification from GivenData, Information and Control, vol. 37, no. 3, pp. 302-320, 1978.

[18] D. Grigori, F. Casati, U. Dayal, and M.C. Shan, ImprovingBusiness Process Quality through Exception Understanding,Prediction, and Prevention, Proc. 27th Intl Conf. Very Large DataBases (VLDB 01), P. Apers, P. Atzeni, S. Ceri, S. Paraboschi,K. Ramamohanarao, and R. Snodgrass, eds., pp. 159-168, 2001.

[19] J. Herbst, A Machine Learning Approach to Workflow Manage-ment, Proc. 11th European Conf. Machine Learning, pp. 183-194,2000.

[20] J. Herbst, Dealing with Concurrency in Workflow Induction,Proc. European Concurrent Eng. Conf., U. Baake, R. Zobel, and M. Al-Akaidi, eds., 2000.

[21] J. Herbst, Ein induktiver Ansatz zur Akquisition und Adaptionvon Workflow-Modellen, PhD thesis, Universita t Ulm, Nov.2001.

[22] J. Herbst and D. Karagiannis, Integrating Machine Learning andWorkflow Management to Support Acquisition and Adaptation of Workflow Models, Proc. Ninth Intl Workshop Database and ExpertSystems Applications, pp. 745-752, 1998.

[23] J. Herbst and D. Karagiannis, An Inductive Approach to theAcquisition and Adaptation of Workflow Models, Proc. Work-shop Intelligent Workflow and Process Management: The New Frontier for AI in Business,M. Ibrahim and B. Drabble, eds., pp. 52-57, Aug.1999.

[24] J. Herbst and D. Karagiannis, Integrating Machine Learning andWorkflow Management to Support Acquisition and Adaptation of Workflow Models, Intl J. Intelligent Systems in Accounting,Finance, and Management, vol. 9, pp. 67-92, 2000.

[25] IDS Scheer, ARIS Process Performance Manager (ARIS PPM),http://www.ids-scheer.com, 2002.

[26] S. Jablonski and C. Bussler, Workflow Management: ModelingConcepts, Architecture, and Implementation. London: Intl ThomsonComputer Press, 1996.

[27] B. Kiepuszewski, Expressiveness and Suitability of Languagesfor Control Flow Modelling in Workflows, PhD thesis, Queens-land Univ. of Technology, Brisbane, Australia, 2002, available viahttp://www.tm.tue.nl/it/research/patterns.

[28] F. Leymann and D. Roller, Production Workflow: Concepts andTechniques. Upper Saddle River, New Jersey, Prentice-Hall PTR,1999.

[29] H. Mannila and D. Rusakov, Decomposing Event Sequences intoIndependent Components, Proc. First SIAM Conf. Data Mining,V. Kumar and R. Grossman, eds., pp. 1-17, 2001.

[30] H. Mannila, H. Toivonen, and A.I. Verkamo, Discovery of Frequent Episodes in Event Sequences, Data Mining and Knowl-edge Discovery, vol. 1, no. 3, pp. 259-289, 1997.

[31] L. Maruster, W.M.P. van der Aalst, A.J.M.M. Weijters, A. van denBosch, and W. Daelemans, Automated Discovery of WorkflowModels from Hospital Data, Proc. 13th Belgium-Netherlands Conf. Artificial Intelligence (BNAIC 2001), B. Krose, M. de Rijke,

G. Schreiber, and M. van Someren, eds., pp. 183-190, 2001.

[32] L. Maruster, A.J.M.M. Weijters, W.M.P. van der Aalst, and A. vanden Bosch, Process Mining: Discovering Direct Successors inProcess Logs, Proc. Fifth Intl Conf. Discovery Science (DiscoveryScience 2002),pp. 364-373, 2002.

[33] M.K. Maxeiner and K. Kuspert, and F. Leymann, Data Miningvon Workflow-Protokollen zur teilautomatisierten Konstruktionvon Prozessmodellen, Proc. Datenbanksysteme in Bu ro, Technik undWissenschaft, pp. 75-84, 2001.

[34] M. zur Mu hlen, Process-Driven Management Information

SystemsCombining Data Warehouses and Workflow Technol-ogy, Proc. Intl Conf. Electronic Commerce Research (ICECR-4),B. Gavish, ed., pp. 550-566, 2001.

[35] M. zur Mu hlen, Workflow-Based Process Controlling-Or: WhatYou Can Measure You Can Control, Workflow Handbook 2001,Workflow Management Coalition, L. Fischer, ed., pp. 61-77, Light-house Point, Fla.: Future Strategies, 2001.

[36] M. zur Mu hlen and M. Rosemann, Workflow-Based ProcessMonitoring and ControllingTechnical and Organizational Is-sues, Proc. 33rd Hawaii Intl Conf. System Science (HICSS-33),R. Sprague, ed., pp. 1-10, 2000.

[37] T. Murata, Petri Nets: Properties, Analysis and Applications,Proc. IEEE, vol. 77, no. 4, pp. 541-580, Apr. 1989.

[38] L. Pitt, Inductive Inference, DFAs, and Computational Complex-ity, Proc. Intl Workshop Analogical and Inductive Inference (AII),K.P. Jantke, ed., pp. 18-44, 1889.

[39] Lectures on Petri Nets I: Basic Models,Lecture Notes in Computer

Science, vol. 1491, W. Reisig and G. Rozenberg, eds., Berlin:Springer-Verlag, 1998.[40] M. Sayal, F. Casati, M.C. Shan, and U. Dayal, Business Process

Cockpit, Proc. 28th Intl Conf. Very Large Data Bases (VLDB 02),pp. 880-883, 2002.

[41] G. Schimm, Process Mining, http://www.processmining.de/,2004.

[42] G. Schimm, Generic Linear Business Process Modeling, Proc. ER2000 Workshop Conceptual Approaches for E-Business and The WorldWide Web and Conceptual Modeling, S.W. Liddle, H.C. Mayr, andB. Thalheim, eds., pp. 31-39, 2000.

[43] G. Schimm, Process Mining Elektronischer Gescha ftsprozesse,Proc. Elektronische Gescha ftsprozesse, 2001.

[44] G. Schimm, Process Mining linearer ProzessmodelleEin Ansatzzur Automatisierten Akquisition von Prozesswissen, Proc. 1.Konferenz Professionelles Wissensmanagement,2001.

[45] G. Schimm, Process MinerA Tool for Mining Process Schemes

from Event-Based Data, Proc. Eighth European Conf. ArtificialIntelligence (JELIA),S. Flesca and G. Ianni, eds., pp. 525-528, 2002.[46] Staffware, Staffware Process Monitor (SPM), http://www.staff

ware.com, 2002.[47] H.M.W. Verbeek, T. Basten, and W.M.P. van der Aalst, Diagnos-

ing Workflow Processes Using Woflan, The Computer J., vol. 44,no. 4, pp. 246-279, 2001.

[48] A.J.M.M. Weijters and W.M.P. van der Aalst, Process Mining:Discovering Workflow Models from Event-Based Data, Proc. 13thBelgium-Netherlands Conf. Artificial Intelligence (BNAIC 2001),B. Krose, M. de Rijke, G. Schreiber, and M. van Someren, eds.,pp. 283-290, 2001.

[49] A.J.M.M. Weijters and W.M.P. van der Aalst, RediscoveringWorkflow Models from Event-Based Data, Proc. 11th Dutch-Belgian Conf. Machine Learning (Benelearn 2001),V. Hoste and G. dePauw, eds., pp. 93-100, 2001.

[50] A.J.M.M. Weijters and W.M.P. van der Aalst, Workflow Mining:

Discovering Workflow Models from Event-Based Data, Proc.ECAI Workshop Knowledge Discovery and Spatial Data,C. Dousson,F. Ho ppner, and R. Quiniou, eds., pp. 78-84, 2002.



15/15

Wil van der Aalst is a full professor ofinformation systems and head of the section ofinformation and technology in the Department ofTechnology Management at Eindhoven Univer-sity of Technology. He is also a part-time fullprofessor at the Computing Science section inthe Department of Mathematics and ComputerScience at the same university and an adjunctprofessor at Queensland University of Technol-

ogy. His research interests include informationsystems, simulation, Petri nets, process models, workflow managementsystems, verification techniques, enterprise resource planning systems,computer supported cooperative work, and interorganizational businessprocesses.

Ton Weijters is an associate professor in theDepartment of Technology Management at theEindhoven University of Technology (TUE), anda member of the BETA research group. Cur-rently, he is working on 1) the application ofknowledge engineering and machine learningtechniques for planning, scheduling, and pro-cess mining, and 2) fundamental research in thedomain of machine learning and knowledge

discovering. He is the author of many scientificpublications in the mentioned research field.

Laura Maruster received the BS degree in 1994and MS degree in 1995, both from the ComputerScience Department at West University ofTimisoara, Romania. At present, she is a PhDcandidate in the Department of TechnologyManagement at Eindhoven University of Tech-nology, Eindhoven, The Netherlands. Her re-search interests include induction of machinelearning and statistical models, process mining,and knowledge discovery.

. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.


11. workflow mining discovering process models from event logs

Documents