developing process mining tools - ulisboa · committee: prof. andreas wichert september 2009....

81
Developing Process Mining Tools An Implementation of Sequence Clustering for ProM Gabriel Martins Veiga Dissertation for the degree of Master of Science in Information Systems and Computer Engineering Jury President: Prof. Jos´ e Tribolet Supervisor: Prof. Diogo R. Ferreira Committee: Prof. Andreas Wichert September 2009

Upload: others

Post on 25-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Developing Process Mining Tools

    An Implementation of Sequence Clustering for ProM

    Gabriel Martins Veiga

    Dissertation for the degree of

    Master of Science in Information Systems andComputer Engineering

    Jury

    President: Prof. José Tribolet

    Supervisor: Prof. Diogo R. Ferreira

    Committee: Prof. Andreas Wichert

    September 2009

  • Acknowledgments

    To my family, especially my parents and my brother who always supported me and made myacademic path possible.

    To Prof. Diogo Ferreira for his excellent orientation and availability to help. His suggestions andguidance given throughout this year greatly improved the value of this dissertation.

    To the other members of our research group, namely Pedro Martins and Gil Aires for the supportgiven and for the exchange of ideas that occurred during this past year.

    Also to all my friends especially the ones that accompanied me during the years spent in college.

    i

  • ii

  • Abstract

    The goal of process mining is to extract useful information from event logs that record the activitiesan organization performs. There are many process mining techniques to discover a process model,based on some event log. These techniques perform well on structured processes, but have problemswith less structured ones. In this case the logs are very confusing and have high quantities of noise,making it difficult to extract useful information. The models generated for such logs tend to bedifficult to read and contain unrelated behavior.

    In this work we present an approach that aims at overcoming these difficulties by extractingonly the useful data and presenting it in an understandable manner. For this purpose sequenceclustering algorithms are used to cluster the log into smaller logs (clusters) that correspond toa set of related cases. For each cluster, a model in the form of a Markov chain is presented.A preprocessing stage was also developed, to clean the log of certain irrelevant elements thatcomplicate the models generated.

    The approach was fully implemented in the ProM framework and all the experiments were per-formed in that environment. Taking into account the results achieved for a real-world case studyand the results of several experiments, we conclude that the approach is capable of dealing withcomplex logs, eliminating unnecessary behavior and partitioning different types of behavior intomore understandable models. We also conclude that the sequence clustering algorithm providesgood results when compared to other clustering methods to divide sequences in a process miningcontext.

    Keywords

    Process Mining, Preprocessing, Sequence Clustering, ProM, Markov Chains, Event Logs, Hierar-chical Clustering, Process Models

    iii

  • iv

  • Resumo

    O objectivo da extracção de processos é obter informação relevante a partir dos logs de eventosque registam as actividades executadas numa organização. Existem várias técnicas nesta área quea partir desses logs geram modelos de processos. Estas técnicas apresentam bons resultados emprocessos bem estruturados, mas têm problemas quando aplicadas a processos pouco estruturados.Nestes casos os logs são muito confusos e têm uma grande quantidade de rúıdo, dificultando aextracção de informação útil. Para estes logs, o modelo gerado é dif́ıcil de compreender e poderáincluir comportamento de casos bastante distintos.

    Neste trabalho apresentamos uma abordagem que visa ultrapassar estas dificuldades, extraindoapenas a informação relevante e apresentando-a de forma leǵıvel. Para isso algoritmos de clusteringde sequências são utilizados para dividir o log em logs mais pequenos (clusters) que correspondem aum conjunto de casos relacionados. Para cada cluster, um modelo em forma de cadeia de Markové apresentado. Também se desenvolveu uma fase de pré-processamento, para limpar o log deelementos que poderão complicar desnecessariamente os modelos obtidos.

    A abordagem foi implementada na ferramenta ProM e todas as experiências foram executadasnesse ambiente. Tendo em conta os resultados obtidos num caso de estudo real e os resultados dediversas experiências, conclui-se que a abordagem é capaz de lidar com logs complexos, eliminandocomportamento desnecessário e dividindo diferentes tipos de comportamento em modelos maiscompreenśıveis. Também se conclui que o algoritmo de clustering de sequências apresenta bonsresultados quando comparado a outros algoritmos de clustering ao dividir sequências no contextoda extracção de processos.

    Palavras Chave

    Extracção de Processos, Pré-processamento, Clustering de Sequências, ProM, Cadeias de Markov,Logs de Eventos, Clustering Hierárquico, Modelos de Processos

    v

  • vi

  • Contents

    1 Introduction 1

    1.1 Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Process Mining Tools 5

    2.1 ProM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.2 Mining Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.3 Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.4 Process Mining with Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.5 Trace Clustering Approach for Process Mining . . . . . . . . . . . . . . . . . . . . 10

    2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Sequence Clustering for ProM 15

    3.1 Sequence Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Applications of Sequence Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.4 Implementation within ProM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.4.1 Preprocessing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.4.2 Sequence Clustering Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    vii

  • 3.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4 Experiments and Evaluation 27

    4.1 Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.2 Patient Treatment Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.3 Telephone Repair Process: Comparing Clustering Methods . . . . . . . . . . . . . 36

    5 Case Study: Application Server Logs 39

    5.1 Case study description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.2 Log Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.3 Preprocessing stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.4 Sequence Clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    6 Conclusion 47

    6.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    A Published paper 55

    viii

  • List of Figures

    2.1 Overview of the ProM Framework (adapted from [1]) . . . . . . . . . . . . . . . . . 6

    2.2 MXML Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.3 Process Model of the example log . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.4 DWS mining result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.5 Trace Clustering result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.6 Process Model for cluster (1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.1 Example of a cluster model (Markov Chain) displayed in the sequence clusteringplug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.2 Markov chain – Matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.3 Sequence Clustering plug-in in the ProM framework . . . . . . . . . . . . . . . . . 20

    3.4 Preprocessing stage for the Sequence Clustering plug-in . . . . . . . . . . . . . . . 22

    3.5 Cluster Inspection in the Sequence Clustering plug-in . . . . . . . . . . . . . . . . 23

    3.6 Cluster model with no threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.7 Cluster model with an edge threshold of 0.06 . . . . . . . . . . . . . . . . . . . . . 24

    4.1 Model for the initial log of the issue handling process . . . . . . . . . . . . . . . . . 28

    4.2 Sequences present in the log of the Issue Handling Process . . . . . . . . . . . . . . 29

    4.3 Cluster 1: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.4 Cluster 2: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.5 Cluster 3: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    ix

  • 4.6 Cluster 4: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.7 Cluster 3.1: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.8 Cluster 3.2: Issue Handling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.9 Model for the initial log of the Patient Treatment Process . . . . . . . . . . . . . . 32

    4.10 Events present in the log of the Patient Treatment Process . . . . . . . . . . . . . 34

    4.11 Sequences present in the log of the Patient Treatment Process . . . . . . . . . . . . 34

    4.12 Cluster 1: Patient Treatment Process . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.13 Cluster 2: Patient Treatment Process . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.14 Cluster 3: Patient Treatment Process . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    5.1 System infrastructure of a public institution . . . . . . . . . . . . . . . . . . . . . . 40

    5.2 Application Server Logs Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.3 Spaghetti model obtained from the application server logs using the heuristics miner. 43

    5.4 Events related with exceptions in the application server logs . . . . . . . . . . . . . 44

    5.5 Some of the behavioral patterns discovered from the application server logs usingthe sequence clustering plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    x

  • List of Tables

    2.1 Example of an event log with 70 process instances, for the process of patient carein a hospital (A: Register patient, B: Contact family doctor, C: Treat patient, D:Give prescription to the patient, E: Discharge patient) . . . . . . . . . . . . . . . . 8

    4.1 Correspondence between letters and events . . . . . . . . . . . . . . . . . . . . . . 33

    4.2 Complexity metrics of the process models from the clusters generated by the threedifferent clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    xi

  • xii

  • CHAPTER 1

    Introduction

    The growing demand for faster and more structured procedures in organizations has resulted inthe proliferation of information systems. However, the existence of an information system toaccomplish a given task does not ensure the most efficient way to execute that task, especiallywhen several systems are required to execute it. Performance issues are a common problem thatorganizations face and therefore optimization is frequently a priority. Optimizing the way an orga-nization performs its processes leads to an increase of efficiency, adding value to the organization.To optimize a process the organizations must understand how a process is being executed, usuallythis involved a long period of analysis, including interviews with all the persons responsible for agiven part of the process.

    The appearance and proliferation of Process-Aware Information Systems [2] (such as ERP, WFM,CRM and SCM systems) has opened the door for a more efficient type of method to study theexecution of processes, called process mining [3]. These systems typically record events executedduring a business process execution and analyzing these logs can yield important knowledge toimprove the execution of processes and improve the quality of the organization’s services. This iswhere process mining comes in.

    1.1 Process Mining

    The process mining area is concerned with the discovery, the monitoring and the improvement ofreal processes (not assumed processes) by extracting information from event logs. Process miningtechniques can generally be grouped into three types: (1) discovery of process knowledge likeprocess models [4, 5, 6], (2) conformance checking, i.e. measure the conformance between modeled

    1

  • behavior (defined process models) and observed behavior (process execution present in logs) [7, 8]and (3) extension of a process model with information extracted from event logs (like identifyingbottlenecks in a process model).

    The main application of process mining is the discovery of process models. Therefore, muchinvestigation has been performed in order to improve the models produced. However there arestill issues that complicate the discovery of comprehensible models and that need to be addressed.For processes with a lot of different cases and high diversity of behavior, the models generatedtend to be very confusing and difficult to understand. These models are usually called spaghettimodels [9]. Clustering techniques have been investigated as the means to deal with this complexityby dividing cases into clusters, leading to less confusing models. However, results still suffer fromthe presence of certain unusual cases that include noise and ad-hoc behavior, which are commonin real-world environments. This type of behavior can have different origins, like human error inexecuting a given process, incomplete executions of a process or errors produced by the systems.Known types of noise are for example the inversion in the order of activities, existence of unrelatedactivities or lack of needed activities. Usually this type of behavior is not relevant to understanda process and it unnecessarily complicates the discovered models.

    1.2 Motivation

    In this dissertation we present an approach that is able to deal with these problems by means ofsequence clustering techniques. This is a kind of model-based clustering that partitions the casesaccording to the order in which events occurred. For the purpose of this work the model used torepresent each cluster is a first-order Markov Chain. The fact that this clustering is probabilisticmakes it suitable to deal with logs containing many different types of behavior, possibly non-recurrent behavior as well. When sequence clustering is applied, the log is divided into a numberof clusters and the correspondent Markov Chains are generated.

    Additionally, the approach also comprises a preprocessing stage, where the goal is to clean thelog of certain events that will only complicate the clustering method and its results. If afterboth techniques are applied the models are still confusing, sequence clustering can be re-appliedhierarchically within each cluster until understandable results are obtained. The approach hasbeen implemented in ProM [1], an extensible framework for process mining that already includesmany techniques to address challenges in this area.

    1.3 Organization

    This document is organized as follows: Chapter 2 provides an overview of existing work involvingclustering and process mining. The framework in which the work was developed is presented,along with some of the most important techniques implemented in that framework.

    Chapter 3 presents the proposed approach, including the preprocessing stage and the sequenceclustering algorithm. The implementation of these techniques in ProM is discussed, including the

    2

  • inputs needed and the outputs produced.

    Chapter 4 presents three experiments using the techniques implemented that demonstrate the useof the techniques and compare the results with other clustering methods.

    Chapter 5 demonstrates the approach in a real-world case study where the goal was to understandthe typical behavior of faults in an application server.

    Finally in chapter 6 we draw conclusions about this thesis and suggest some future work.

    3

  • 4

  • CHAPTER 2

    Process Mining Tools

    Process mining techniques aim at the analysis of business processes by extracting information fromevent logs and are especially useful when little information is available about a given process andobtaining that information is complicated. Most of these techniques are available in the ProMframework.

    In this chapter we present some work related to concepts approached in this dissertation, focusingon process mining techniques available in the ProM framework. First we present the framework,then we show an overall view of the different types of process mining tools available in ProM andfinally we explore in greater detail two of those tools that focus on applying clustering methodsto process mining.

    2.1 ProM

    The environment in which this dissertation is based is the ProM Framework1 [1, 10]. ProM isan extensible framework aimed at process mining, that is issued under an open source license,therefore the development of plug-ins is possible and encouraged. Many plug-ins resulting frominvestigation work have been developed in three major categories: mining, analysis and conver-sion. Figure 2.1 presents an overview of the ProM Framework architecture, it shows the relationsbetween the framework, the plug-ins and the event log.

    The event log that usually serves as input to the plug-ins has a specific format based on XMLand defined for this framework, it is called MXML [11]. This format follows a specified schemadefinition, which means the log does not consist of random and disorganized information, rather

    1For more information and to download ProM visit www.processmining.org

    5

  • Figure 2.1: Overview of the ProM Framework (adapted from [1])

    it contains all the elements needed by the plug-ins at a known location. In Figure 2.2 a snapshotof a MXML log is presented. Each Process Instance corresponds to one execution of a givenprocess and has a set of Audit Trial Entries associated. These entries correspond to the eventsthat occurred during the execution of the process instance and are composed by several attributeslike the WorkflowModelElement that represents the name of the event and the EventType thatclassifies the event according to its state (Start or Complete). There are also other attributes thatidentify the originator of a given event and the timestamp in which the event was executed.

    As shown in Figure 2.1, these event logs are generated by Process-aware Information Systems(PAIS) [2] and are read by the ProM framework using the Log Filter, that can also perform somefiltering to those logs before any other task is performed. The Import plug-ins are used to loadmany different kinds of models like Aris graphs and the Mining plug-ins perform some kind ofmining, storing the results as Frames. These Frames can be used to visualize a Petri Net [12] or aSocial Network [13] for example. Analysis plug-ins can perform further analysis like checking theconformance of a process model and a log for example. Finally Conversion plug-ins can transforma mining result into another format and the Export plug-ins can store the results outside theframework in different kinds of format.

    Next we further explore the mining and analysis plug-ins available in ProM, in particular the onesrelating to our work.

    6

  • A

    complete

    C

    complete

    E

    complete

    C

    complete

    A

    complete

    B

    complete

    D

    complete

    Figure 2.2: MXML Snapshot

    2.2 Mining Tools

    Mining tools are an implementation of a mining algorithm in ProM. They can be divided into threemajor types: (1) Control-flow discovery, (2) Organizational perspective and (3) Data perspective.Some of the control-flow discovery tools include:

    • α-algorithm plug-in – it implements the α-algorithm [4], constructing a Petri net that modelsthe workflow of the process. It establishes a set of relations between tasks and assumes that alog is complete (all possible behavior is present). This algorithm presents some shortcomings,namely it is not robust to noise and it cannot mine processes with short-loops or duplicatetasks. Some work has been done to extend this algorithm, for instance to be able to mineshort-loops [14] and to detect implicit dependencies [15].

    • Heuristics miner plug-in [5] – it implements a heuristics driven algorithm, that is especiallyuseful for dealing with noise, by only expressing the main behavior present in a log. Thismeans that not all details are shown to the user and exceptions are ignored. To illustratewhat kind of graph is presented by this tool, we created a simple example log2 shown in

    2Real life logs generate much more complex and confusing models, this example is only used to present someconcepts.

    7

  • Id Process Instance Frequency1 ABCE 202 ACE 123 CABD 104 CAB 145 CDBE 14

    Table 2.1: Example of an event log with 70 process instances, for the process of patient care in ahospital (A: Register patient, B: Contact family doctor, C: Treat patient, D: Give prescription tothe patient, E: Discharge patient)

    Table 2.1 and used the ProM implementation of this algorithm to come to the result shownin Figure 2.3. This tool can also be used when searching for long distance dependencyrelations.

    • Genetic algorithm plug-in [6] – it uses genetic algorithms to calculate the best possibleprocess model for a log. Every individual is assigned a fitness measure that evaluates howwell the individual can reproduce the behavior present in the input log. In this context,individuals are possible process models. Candidate individuals are generated using geneticoperators like crossover and mutation and then the fittest are selected. This algorithm wasproposed to deal with some issues involving the logs, like noise and incompleteness.

    Organizational perspective aims at understanding the different types of business relations estab-lished within an organization, some of the mining tools available in ProM that approach thissubject are:

    • Social network miner plug-in [16] – it takes a log file and determines a social network ofpeople. By using this tool we can identify roles and interactions in an organization, forexample who usually works together or who hands over work to whom.

    • Organizational miner plug-in – from an event log containing originator information, itpresents to the user a graph associating activities and originators.

    Tools that deal with the data perspective make use of additional data attributes present in logs,here is an example:

    • Decision miner [17] – this tool analyzes how data attributes of process instances or activities(such as timestamps or performance indicators) influence the routing of a process instance.To accomplish this, every decision point in the process model is analyzed and if possiblelinked to properties of individual cases (process instances) or activities.

    There are also some plug-ins that deal with less structured processes:

    • Fuzzy miner [18] – the process models of less structured processes, tend to be very confusingand hard to read (usually referred to as spaghetti models). This tool objective is to emphasize

    8

  • Figure 2.3: Process Model of the example log

    graphically the most relevant behavior, by calculating the relevance of activities and theirrelations. To achieve this two metrics are used: (1) significance that measures the level ofinterest we have in events (for example by calculating their frequency on the log) and (2)correlation that determines how closely related two events that follow each other are, so thatevents highly related can be aggregated.

    2.3 Analysis Tools

    Analysis plug-ins have a variety of purposes, like implementing some property analysis on a previ-ously achieved mining result or comparing a process log and a predefined model of how a processshould be executed. Next we present only a few of those that we consider more relevant:

    • Conformance checker – one important question that organizations would like to have an-swered is: Are our processes being executed as we planned? Answering this question hasbeen an active field of investigation [7, 8]. This tool was implemented in ProM to addressthis problem. It analyzes the gap between a model and the real world, detecting violations(bad executions of a process) and ensuring transparency (the model might be outdated). Tomeasure the conformance this tool uses two concepts: (1) fitness that checks if the event logcomplies with the control flow specified by the process model and (2) appropriateness thatchecks if the model describes the behavior present in the event log.

    9

  • • Basic performance analysis – the objective of this tool is to calculate performance measuressuch as the execution time of a process or the waiting time. The tool then presents theresults with several different kinds of graphs.

    • LTL checker [19] – this plug-in checks whether a log satisfies some Linear Temporal Logic(LTL) formula. For example it can check if a given activity is executed by the person thatshould be executing it or check whether an activity A that has to be executed after B isindeed always executed at the correct moment.

    2.4 Process Mining with Clustering

    When generating process models like the one on Figure 2.3, conventional control-flow techniquestend to over-generalize. In the attempt to represent all the different behavior present in thelog these techniques create models that allow for more behavior than the one actually observed.When a log has very different process instances the models generated are even more complex andconfusing. Clustering was approached as a way to overcome this problem [20].

    The approach was implemented in ProM as the Disjunctive Workflow Schema (DWS) mining plug-in. In the methodology developed, first the complete log is examined and a model is generated usingthe Heuristics Miner [5]. Then the log is compared to the model to measure the model’s quality. Ifthe model generated is optimal and no over-generalization is detected the approach stops, otherwisethe log is divided into clusters using the K-means clustering method and their models are tested.If the cluster models still allow for too much behavior the clusters are repartitioned and so on untiloptimal models are achieved. The result of this methodology is the set of all the models createdand the over-generalization points.

    Let’s apply this methodology to our example log described in Table 2.1. By analyzing the modelshown in Figure 2.3, we can conclude that it allows for behavior not present in the log, for examplethe sequence BCA. By running the DWS plug-in available in ProM we come to the result shownin Figure 2.4. The model is presented (right), the navigational tree of models generated wherewe can choose the one to view (top-left) and the over-generalization points detected, where thefirst one refers to the sequence we had identified, stating that A was never executed after BC(bottom-left marked in red). Other clustering methods investigated in the process mining areaare presented in the next section.

    2.5 Trace Clustering Approach for Process Mining

    Trace Clustering [21] is another approach investigated in the process mining area as a way topartition the log, grouping similar sequences together. The motivation behind this work wasthe existence of flexible environments, where the execution of processes does not follow a rigidset of rules; although the notion of process is present, the actors are able to execute it differentlyaccording to each case. An example of such environments is the healthcare, where strictly followinga process is not a priority compared to providing the best care for patients.

    10

  • Figure 2.4: DWS mining result

    In these environments and particularly when a large number of cases (process instances) is recordedin the log, the main problem is the diversity; i.e. single process instances differ significantly fromone another, therefore there are several different types of sequences and the models generated byconventional techniques are very confusing (spaghetti-like models).

    This approach addresses the issue using distance-based clustering along with profiles, with thepurpose of reducing the diversity by lowering the number of cases analyzed at once. Each profileis formed by a set of items that describe a case from a particular perspective. Every item is ametric that assigns a numeric value to each case and therefore a profile can be viewed as a vectorcontaining the values of all the different items (profiles can be combined resulting in aggregatevectors). These vectors are then used to calculate the distance between two cases, using distancemetrics (like the Euclidean distance or the Hamming distance). Examples of such profiles are:

    • Transition – The items in this profile are direct following relations of the sequence (thatforms a process instance). For any two events (A, B) there is an item measuring how oftenB as directly followed A.

    • Case Attributes – The items in this profile are the data attributes of the process instance.When process instances are annotated with meta-information, comparing that informationcan be an efficient way to compare the instances.

    Finally clustering methods such as the ones presented next can be applied to group closely related

    11

  • Figure 2.5: Trace Clustering result

    cases in the same cluster:

    • K-means Clustering – It is one of the most used clustering methods and constructs k clustersby dividing the data into k groups.

    • Self-Organizing Map (SOM) – It is a neural network technique, which is used to map highdimensional data onto low dimensional spaces. Similar cases are mapped close to each otherin the SOM. In Figure 2.5 we can see the resulting output of this method (in ProM) whenapplied to our example log. Three clusters were generated and in the map we can analyzethe similarity between different cases and different clusters. If the cases are close togetherin the map and the color separating them is light then those cases are very similar.

    These algorithms are available in the ProM framework via the trace clustering plug-in. Figure 2.6shows the process model (generated by the Heuristics Miner) for one of the clusters created whenapplying the SOM method. We can now clearly identify a type of sequence (no diversity presentin that subset of the original log), it refers to a case (CDBE) where the patient was not registered;i.e. when analyzing the clusters we can discover different types of behavior, including types ofsequences that are not being executed as they should.

    Recent work has been done to improve the results produced by trace clustering. A context awareapproach based on generic edit distance was presented in [22]. In this work a method was definedto automatically derive a cost function to calculate the costs of edit operations, which takes intoaccount the context of an event within a sequence. Considering the context can be valuablegiven that the events present in the sequences and the order in which they occur have a semanticrelevance.

    To understand the usefulness of this kind of clustering a comparison was made between different

    12

  • Figure 2.6: Process Model for cluster (1,1)

    trace clustering methods [22]. Comparing the results produced by one trace clustering approachto another is not trivial, due to the difficulty in understanding if one cluster is better formed thananother. A better formed cluster is one in which the sequences have a higher degree of similarityand consequently the models for those clusters are easier to understand. Therefore a processmining perspective was proposed to evaluate the goodness of clusters by analyzing the models ofthose clusters. Fitness and comprehensibility metrics were used to evaluate the complexity of themodels. By comparing these metrics the approach proved to generate less complex cluster models,indicating that better formed clusters were achieved.

    2.6 Conclusion

    In this chapter we have introduced some important concepts relating to our work. The frameworkused throughout this dissertation was presented and also the types of tools available in thatframework. The continuous growth of ProM is due to the importance that process mining hasgained in recent years, resulting in numerous research performed by people around the world.

    We presented in greater detail two solutions involving the application of clustering techniques toprocess mining. The Trace Clustering approach is particularly relevant to our work, being thatthey share the common goal of sub-dividing the initial log in smaller logs, as to facilitate thedetection of patterns.

    13

  • The difference between the two approaches are the techniques used to achieve that goal. Theemphasis of our solution is to approach the problem of noise and ad-hoc behavior that complicatethe identification of patterns in logs originated by real-world information systems and the resultsproduced by the clustering methods presented. To accomplish this we combine different techniquesthat are presented in the next chapter.

    14

  • CHAPTER 3

    Sequence Clustering for ProM

    Like the clustering techniques described in the previous chapter, sequence clustering can take aset of sequences and group them into clusters, so that similar types of sequence are placed in thesame cluster. However, this type of clustering is performed directly on the sequences, as opposedto being performed on features extracted from those sequences. Sequence clustering has beenextensively used in the field of bioinformatics, for example to classify large protein datasets intodifferent families [23]. Process mining also deals with sequences, but instead of aminoacids thesequences contain events that have occurred during the execution of a given process. Sequenceclustering techniques are therefore a natural candidate to perform clustering on workflow logs. Inthis chapter the techniques that form our solution are explored and the way these techniques wereimplemented is presented, including the outputs produced.

    3.1 Sequence Clustering

    The sequence clustering algorithm used here is based on first-order Markov chains [24, 25]. Eachcluster is represented by the corresponding Markov chain and by all the sequences assigned to it.

    A Markov chain is composed by a set of states and by the transition probabilities between them.In first-order Markov chains the probability of a given transition to a future state depends onlyon the current state. For the purpose of process mining it becomes useful to augment the simpleMarkov chain model with two dummy states: the input and the output state. This is necessaryin order to represent the probability of a given event being the first or the last event of the chain,which may become useful to distinguish between some types of sequences.

    Figure 3.1 shows a simple example of such a chain depicted in ProM via the sequence clustering

    15

  • Figure 3.1: Example of a cluster model (Markov Chain) displayed in the sequence clusteringplug-in

    ◦ a b c d e •◦ 0.0 1.0 0.0 0.0 0.0 0.0 0.0a 0.0 0.0 0.892 0.108 0.0 0.0 0.0b 0.0 0.0 0.0 0.0 1.0 0.0 0.0c 0.0 0.0 0.0 0.0 1.0 0.0 0.0d 0.0 0.0 0.0 0.0 0.0 0.368 0.632e 0.0 0.0 0.0 0.0 0.0 0.0 1.0• 0.0 0.0 0.0 0.0 0.0 0.0 0.0

    Figure 3.2: Markov chain – Matrix representation

    plug-in developed in this work. In this model, darker elements (both states and transitions) aremore recurrent than lighter ones. By analyzing the color of the elements and the probabilityassociated with each transition it is possible to decide which elements should be kept for analysis,and which elements can be discarded. For example, one may choose to remove transitions thathave very low probabilities, so that only the most typical behavior can be analyzed.

    Figure 3.2 corresponds to the matrix representation of the Markov chain shown in Figure 3.1,where each column and each line correspond to an event and the matrix is ordered alphabeticallyfrom a to e (considering that the first and the last state correspond to the two dummy statesadded). In this representation the matrix values are the transition probabilities, for example thetransition from a to c has a 10.8% probability of occurring. Every line in the matrix must benormalized; i.e. the sum of all the transition probabilities originating on a given state must equalone. Notice that in the first column and in the last line all values are zero and will be so in everyMarkov chain generated by our solution, because the input state is a dummy first state that isnever transitioned to and the output state is a dummy final state that never transitions anywhere.

    16

  • As said before these are first order Markov chains, there are also nth order chains where theprobability of transition to a future state depends on the previous n states. An example of recentwork developed with higher-order Markov chains can be found in [26].

    The assignment of sequences to clusters is based on the probability of each cluster producing thegiven sequence. In general, any given sequence will be assigned to the cluster that is able toproduce it with higher probability. Let ◦ and • denote the input and output states, respectively.To calculate the probability of a sequence x = {◦, x1, x2, · · · , xL, •} being produced by cluster ckthe following formula is used:

    p (x | ck) = p (x1 | ◦; ck) ·[

    L∏

    i=2

    p (xi | xi−1; ck)]· p (• | xL; ck) (3.1)

    where p (xi | xi−1; ck) is the transition probability from xi−1 to xi in the Markov chain associatedwith cluster ck. This formula handles the input and output states in the same way as any otherregular state that corresponds to an event.

    The goal of sequence clustering is to estimate these parameters for all clusters ck with k =1, 2, · · · ,K based on a set of input sequences. For that purpose, the algorithm relies on anExpectation–Maximization procedure [27] to improve the model parameters iteratively. For agiven number of clusters K the algorithm proceeds as follows:

    1. Initialize randomly the state transition probabilities of the Markov chains associated witheach cluster.

    2. Assign each sequence to the cluster that can produce it with higher probability according toequation (3.1).

    3. Recompute the state transition probabilities of the Markov chain of each cluster, consideringthe sequences that were assigned to that cluster in the previous step.

    4. Repeat steps 2 and 3 until the assignment of sequences to clusters does not change, andhence the cluster models do not change either.

    In other words, first we randomly distribute the sequences into the clusters (steps 1 and 2), thenin step 3 we re-estimate the cluster models (Markov chain and its probabilities) according to thesequences assigned to each cluster. After this first iteration we re-assign the sequences to clustersand again re-estimate the cluster models (steps 2 and 3). These two steps are executed repeatedlyuntil the algorithm converges. The result is a set of Markov models that describe the behavior ofeach cluster.

    The random initialization of the transition probabilities is an important feature of this algorithmthat introduces a certain level of uncertainty in the results achieved. The sequence clusteringalgorithm can therefore generate a different set of clusters for the same sequences. Throughoutthis dissertation we minimized the impact of this uncertainty by applying the algorithm severaltimes (usually five) and choosing the result that occurred more often.

    17

  • 3.2 Applications of Sequence Clustering

    Sequence clustering algorithms have been an active field of investigation in the area of bioinfor-matics [23, 28], as mentioned earlier. Although this has been the area primarily associated withsequence clustering, some work has been done with this type of algorithms in other areas.

    In [24], the goal was to analyze the navigation patterns on a website, these patterns consisted ofsequences of URL categories followed by users. Sequence clustering was the approach chosen topartition site users, placing users with similar navigation paths in the same cluster. The behaviorof the users present in each cluster is then displayed and can be analyzed to understand theparticular interests of different types of user.

    In the field of process mining, sequence clustering has also been investigated [25]. The motivationbehind that work was the fact that an event log can contain events originating from differentprocesses; i.e. the idea was to not make the assumption that an event log only contains events ofone process, but instead can be a mixture of different processes without any information statingwhich events correspond to what processes. The goal was to develop an approach that would beable to extract sequences of related events (relating to the same case) from those chaotic logs.After identifying the sequences, the Microsoft Sequence Clustering Algorithm (available in SQLServer [29]) is applied to group similar sequences in the same cluster, without the need for anybusiness logic information.

    The environment developed to test this approach was an application that executed sequences ofactions over a database and recorded these actions in logs. After extracting the sequences from theevent log with some methodology, the sequence clustering algorithm was applied with a specificnumber of clusters as to generate a new cluster for each of the different types of sequences identified.Consequently, the model generated for each cluster constitutes a deterministic graph (the transitionprobabilities all equal 1.0) and the visualization of each model leads to the identification of asequence type executed in the environment tested.

    3.3 Preprocessing

    Although the sequence clustering algorithm described above is robust to noise, all sequences mustultimately be assigned to a cluster. However, if a sequence is very uncommon and different fromall the others it will affect the probabilistic model of that cluster and in the end will make it harderto interpret the model of that cluster. To avoid this problem, some preprocessing must be doneto the input sequences prior to applying sequence clustering. This preprocessing can be seen as away to clean the dataset of undesired states (events) and also as a way to eliminate undesirablesequences. For example, undesired events can be events that occur rarely and undesired sequencescan be single step sequences that only have one state.

    Some of the steps that can be performed during preprocessing are described in [30] and include,for example, dropping events and sequences with low support. In this work we have extendedthese steps by allowing not only the least but also the most recurring events and sequences to bediscarded. This was motivated by the fact that in some real-world applications the log is filled

    18

  • with some very frequent but irrelevant events (such as debug messages) that must be removed inorder to allow the analysis to focus on the relevant behavior. Spaghetti models are often clutteredwith events that occur very often but only contribute to obscure the process model one aims todiscover.

    The preprocessing steps implemented within the sequence clustering plug-in are optional andconfigurable. They focus on the following features:

    1. Event type – The events recorded in a MXML log file may represent different points inthe lifetime of workflow activities, such as the start or completion of a given activity. Forsequence clustering what is important is the order of activity execution, so we retain onlyone type of event and that is usually the completion event for each activity. Therefore onlyevents of type “complete” are kept after this step.

    2. Event support – Some events may be so infrequent that they are not relevant for the purposeof discovering typical behavior. These events should be removed in order to facilitate analysis.On the other hand, some events may be so frequent that they too become irrelevant and evenundesirable if they hide the behavior one aims to discover. Therefore, this preprocessing canremove events both with very low and too high support.

    3. Consecutive repetitions – Sequence clustering is a means to analyze the transitions betweenstates in a process. If an event is followed by an equal event then it should be considered onlyonce, since the state of the process has not changed. Consecutive repetitions are thereforeremoved, for example: the sequence A → C → C → D becomes A → C → D.

    4. Sequence length – After the previous preprocessing steps, it may happen that some sequencescollapse to only a few events or even to a single event. This preprocessing step provides thepossibility to discard those sequences. It also provides the possibility to discard exceedinglylong sequences which can have undesirable effects in the analysis results. Sequence lengthcan therefore be limited to a certain range.

    5. Sequence support – Some sequences may be rather unique so that they hardly contribute tothe discovery of typical behavior. In principle the previous preprocessing steps will preventthe existence of such sequences at this stage but, as with events, sequences that occur veryrarely can be removed from the dataset. In some applications such as fault detection itmay be useful to actually discard the most common sequences and focus instead on the lessfrequent ones, so sequence support can also be limited to a certain range.

    The order presented is the order in which the preprocessing steps should be applied, because if thesteps are applied in a different order the results may differ. For example, rare sequences shouldonly be removed at the final stage, because previous steps may transform them into commonsequences. Imagining we have the rare sequence A → B → C → D, but in step 2 state B isconsidered to have low support and is removed, then it becomes A → C → D. This new sequencemight not be a rare sequence and therefore should not be removed.

    19

  • Figure 3.3: Sequence Clustering plug-in in the ProM framework

    3.4 Implementation within ProM

    The above preprocessing steps and the sequence clustering algorithm have been implemented as anew plug-in for the process mining framework ProM [1], which offers an environment suitable forextension. Figure 3.3 presents a general view of our solution inserted in that environment and isdiscussed in detail throughout this section. We particularly approach the interaction between thetechniques developed and ProM, including the inputs needed and the outputs produced.

    The starting point as for the majority of the plug-ins in ProM is an event log. We assume thatthe log contains a variety of process instances corresponding to one process and each of theseprocess instances contains a set of audit trail entries. These entries correspond to a given eventexecuted within the process instance and have some attributes like the name, the type and theentity responsible. An event usually marks the beginning or the end of an activity, so these twoconcepts are closely related and therefore are used interchangeably throughout this dissertation.The set of all the entries of a given process instance is considered the sequence of events thatwere performed in that process instance. Different process instances (of the same process) maybe composed by a different sequence, which represent alternative ways in which the process wasexecuted. The main goal of our solution is to group sequences that are somehow related, with anevent log as an input we now have several sequences available, so we can start the implementationof the techniques described.

    20

  • 3.4.1 Preprocessing Stage

    In this stage the log is cleaned of certain elements that might influence negatively the usefulnessof the final results, the objective is not to alter the format of the log, it is to prepare the sequencesfor the sequence clustering algorithm to group them afterwards. The preprocessing stage receivesan input log in MXML format [11], which means that the elements we need are already availableat a known location. If the log we intend to analyze is not in the format mentioned, a frameworkcalled ProM Import1 [31] can be used to restructure and convert the log to the accepted format.

    The other input at this stage are some options provided by the user, which specify the parametersto be used in the preprocessing steps described earlier. Figure 3.4 presents a screenshot of this stageand the options can be seen in the top-right corner: (1) the minimum percentage of occurrenceof an event, (2) the maximum percentage of occurrence of an event, (3) the minimum size of asequence, (4) the maximum size of a sequence, (5) the minimum occurrence of a sequence and(6) the maximum occurrence of a sequence. After the user specifies which elements to keep, thesequences present in the log (left in the figure) are altered or removed. The view is then refreshed toshow the sequences of the preprocessed log. When implementing this technique and the sequenceclustering technique in ProM, the original log is never changed, instead filters are used to createa new log. Rather than modifying the original log, what filters do is construct a new log based onthe original one and on the results produced by those techniques. This is an existing component ofProM that includes different types of filters that can be used prior to any mining tool. To supportour techniques new filters were created.

    The result produced at this stage is a filtered log. This log is made available to the ProM framework,so that it may be analyzed with other plug-ins if desired, so instead of acting just as a first stage tosequence clustering the preprocessing stage can also be used together with other types of analysisavailable in the framework.

    3.4.2 Sequence Clustering Stage

    The sequence clustering stage receives the filtered log as input from the preprocessing stage andalso the desired number of clusters. In general the plug-in will generate a solution with theprovided number of clusters except when some clusters turn out to be empty. Each cluster can beused again as an event log in ProM, so it becomes possible to further subdivide it into clusters,or for another process mining plug-in to analyze it. These features allow the user to drill-downthrough the behavior of clusters.

    The plug-in provides special functionalities for visualizing the results, both in terms of sequencesthat belong to each cluster and in terms of the Markov chain for each cluster.

    In the first type of visualization depicted in Figure 3.5, the clusters generated are shown (left-handside), the set of instances present in each cluster are presented (middle) and finally the sequenceof events that compose each type of instance can also be inspected (right-hand side). As shown inFigure 3.5 the sequences within each cluster are aggregated according to their types, which means

    1Both the schema definition for MXML and the ProM Import framework can be downloaded fromwww.processmining.org

    21

  • Figure 3.4: Preprocessing stage for the Sequence Clustering plug-in

    we can immediately identify how many different types of sequence were assigned to a cluster andhow many sequences are there of each type. As an example we can see in the figure the inspectionof “Cluster 2”, to which fifteen types of process instances were assigned. There are forty instancesof the type highlighted in the image and the sequence of events that compose this type of processinstance is the one shown in the right-hand side and is formed by six events. This visualization isespecially useful to identify the frequency of occurrence for different types of behavior, for exampleone can conclude that the last type of process instance seen in the figure that occurs only once isa rare sequence of events, probably originating from noise or ad-hoc behavior and therefore notrelevant to understand the behavior of the process being analyzed.

    The second type of visualization available makes this plug-in a mixture between a mining plug-inand an analysis plug-in. On one hand, sequence clustering is a mining plug-in that extracts modelsof behavior for the different behavioral patterns found in an input event log. Figure 3.6 shows thetype of results that the plug-in is able to present. When visualizing the results, the user can adjustthresholds that correspond to the minimum and maximum probability of both edges and nodes(right-hand side of fig.3.6). This allows the user to adjust what is shown in the graphical modelby removing elements (both states and transitions) that are either too frequent or too rare. Thisfeature facilitates the understanding of spaghetti models by taking advantage of the probabilisticnature of sequence clustering and without having to re-run the algorithm again. This can be seenas a post-processing of the cluster models achieved. An example of the usefulness of this featureis the difference between figure 3.6 that represents all the behavior present in a cluster2 and figure3.7 that only represents transitions occurring above the threshold of 0.06. The difference in thecomplexity of the two models after eliminating less recurrent behavior is noticeable.

    2This cluster results from the division into two clusters of a log available in the ProM website

    22

  • Figure 3.5: Cluster Inspection in the Sequence Clustering plug-in

    Figure 3.6: Cluster model with no threshold

    23

  • Figure 3.7: Cluster model with an edge threshold of 0.06

    On the other hand, sequence clustering can also be regarded as an analysis plug-in since it generatesnew events logs that can be analyzed by other plug-ins available in the ProM framework. This isalso useful for analyzing spaghetti models, which are hard to understand at first but can be madesimpler by dividing their complete behavior into a set of clusters that can be analyzed separatelyby other algorithms.

    3.5 Hierarchical Clustering

    After applying the previous techniques to a set of sequences, the cluster models generated mightstill be too general and no pattern clearly identified, in such cases reapplying sequence clusteringto the clusters can be useful to extract valuable information from the models. From each clusteroriginated with sequence clustering a log is constructed. These logs are smaller logs than the onethat served as input, have a set of sequences assigned by the sequence clustering algorithm andcan be viewed and used as regular logs.

    Therefore we can re-apply sequence clustering as many times as needed, getting smaller andsmaller logs, that have fewer and increasingly similar sequences (less variation within a cluster).This eventually leads to the discovery of some behavioral pattern in the cluster models.

    3.6 Conclusion

    In this chapter the techniques that form our approach were presented. These techniques wereimplemented in ProM and generate different types of output. The most important outputs are

    24

  • the cluster models in the form of Markov chains, which given its probabilistic and simplistic natureallow the user to understand the patterns present in a cluster.

    The techniques implemented are of great value when analyzing complex logs. Before applyingthe sequence clustering algorithm, the preprocessing stage can be used as a way to eliminateunnecessary behavior that has no relevance on the execution of the process. After running thesequence clustering algorithm, if the models generated are still confusing this algorithm can bere-applied hierarchically on the clusters. Using the probabilistic nature of sequence clustering andMarkov chains is also important to enhance the quality of the models generated. To accomplishthis, the thresholds relating to node and edge occurrence can be adjusted so that the model onlypresents the behavior relevant to the user’s goals.

    After presenting the approach and its implementation, in the next chapter we perform someexperiments to understand how the techniques can be used and the value obtained by usingthem.

    25

  • 26

  • CHAPTER 4

    Experiments and Evaluation

    In this chapter we present three experiments made to demonstrate the capabilities of our solutionand the ways in which the techniques can be applied. The first case relates to an issue handlingprocess, where many connections exist between the activities and the second case relates to ahospital treatment process, where there are several behavioral variations that are difficult tovisualize as a whole. In both cases sequence clustering and preprocessing techniques are used toenable the visualization of behavioral patterns and the extraction of useful conclusions about theexecution of the processes. The third experiment relates to a telephone repair process and unlikethe previous two the objective is to evaluate the sequence clustering algorithm by comparing itwith other clustering methods using a set of metrics.

    4.1 Issue Handling Process

    This experiment is based on the case study presented in [30] and was adapted as to better demon-strate the use of the techniques that were implemented in ProM. The main characteristic of thisexperiment is that the order in which the activities are executed tends to be constantly changing.This is a typical real-world type of problem that complicates the analysis of process models andtherefore is a valuable way to demonstrate how this sequence clustering based solution performsin these situations.

    The experiment involves an issue handling process, that begins with a customer reporting a prob-lem and then a set of activities are executed to resolve that problem. Figure 4.1 shows the Markovchain generated for the initial log without dividing it into clusters. Although the number of eventsis small, the numerous connections between those events complicate the analysis of the model and

    27

  • Figure 4.1: Model for the initial log of the issue handling process

    no pattern of behavior can be clearly identified.

    There are 5 different events present in the log: (1) New - the problem reported enters the systemand is registered as being a new problem; (2) Assigned - the problem is assigned to a person;(3) Open - the resolution of the problem begins; (4) Resolved - the problem has been solved; (5)Duplicated - the problem has been reported in the past and the same resolution is applied. Thesequences formed by these events and present in the log are shown in Figure 4.2, by analyzing themwe can conclude that the last 3 sequences have a low occurrence when compared with the others.Low occurrences like this usually point to ad-hoc behavior, noise or some kind of mistake donewhen executing a process, all of which are irrelevant to understand the usual patterns associatedwith the execution of a process. As an example the last sequence shown in Figure 4.2 does notmake logical sense, since the event “Open” occurred after the issue was considered duplicated andconsequently handled. Given the irrelevance of these sequences we start handling this case byeliminating them in the preprocessing stage.

    To accomplish this, the preprocessing option concerning the frequency of occurrence of sequences isused. After setting the minimum occurrence of a sequence to six, the sequences are removed. Thesequences present in the filtered log can now be divided into clusters. After experimenting withdifferent numbers of clusters, we obtained some understandable results by dividing the sequencesinto four clusters. Figures 4.3 to 4.6 present the models of the clusters generated.

    A significant improvement of the models can be immediately identified, especially when consid-ering the number of possible transitions between states. Another important fact about sequenceclustering is visible in these models and it concerns the difference in complexity between the clustermodels generated. This is a result of the diversity present in the logs and of the grouping executedby the algorithm. There are clusters with more different types of instances than others.

    By further analyzing the cluster models we can understand the different types of behavior presentin the log.

    Cluster 1 represents a specific type of behavior where the process is opened before being assignedand in some cases is not opened at all, this usually points to a simple problem that was immediatelyresolved by the person who registered the issue.

    28

  • Figure 4.2: Sequences present in the log of the Issue Handling Process

    Figure 4.3: Cluster 1: Issue Handling Process

    Figure 4.4: Cluster 2: Issue Handling Process

    Figure 4.5: Cluster 3: Issue Handling Process

    29

  • Figure 4.6: Cluster 4: Issue Handling Process

    Figure 4.7: Cluster 3.1: Issue Handling Process

    Cluster 2 represents the opposite type of behavior where the assignment is done before theopening. Another fact visible in this model is that the problem was not registered as a Newproblem and it should have been as soon as it entered the system.

    Cluster 4 is a clear example of the above mentioned complexity difference that sometimes occurbetween clusters, in this case there was only one type of sequence assigned to the cluster.

    On the other hand, Cluster 3 is the hardest one to analyze, especially due to the existence of aloop between the states New, Assigned and Open. Several options are available to handle this typeof cluster models. (1) Rerun the algorithm with a different number of clusters, however this is themost drastic option that is not guaranteed to improve the results; (2) Apply the sequence algorithmhierarchically, which means further subdividing the cluster or (3) Use the probabilistic capabilitiesof sequence clustering, by adjusting the thresholds discussed in Section 3.4.2. These thresholdswill remove certain types of behavior (for example transitions occurring with low probability) andwill simplify the cluster models. In this case we chose option 2, mainly because other than thetransition between Open and Duplicated, the rest of the elements (both nodes and transitions) arefrequent enough to be relevant. Therefore to avoid losing important behavior we applied sequenceclustering hierarchically, dividing the cluster in two clusters.

    Cluster 3.1 presented in Figure 4.7 is now easier to understand and represents the most commonbehavior, which is analyzing the issue and labeling it as a new problem. After that the issue isassigned to a person, that opens the process of resolution and resolves the problem. Also in thiscluster is a less frequent behavior where only after beginning the resolution of the problem, it isidentified has Duplicated and therefore resolved.

    Cluster 3.2 presented in Figure 4.8 is another example of a cluster with only one type of sequence,which correspond to a behavior where the resolution is opened before the problem is identified asa new problem and before the assignment of that problem to a team member.

    With this small experiment we demonstrated the use of the techniques and the capabilities ofsequence clustering. Starting from a confusing model, we were able to partition the log andextract cleaner models. In these models the different patterns of behavior were identified and

    30

  • Figure 4.8: Cluster 3.2: Issue Handling Process

    conclusions regarding the significance of those behaviors (what they meant in the context of theissue handling process) were drawn.

    4.2 Patient Treatment Process

    Process mining is a useful tool when studying the execution of processes, especially when dealingwith environments where a process can be executed in several different ways. One such environ-ment is the clinical domain, where the technical requirements of some tasks and the importance ofhaving up to date information has increased the number of information systems across the units ofa hospital. These systems are responsible for a great deal of processes executed in a hospital andare the key component when process optimization is pursued. Before optimizing there is a needto understand how processes are being executed and supported, however the analysis of processesbecomes complicated especially due to process variations, the lack of rigid clinical pathways andalso the execution of a process across different units and systems. The inherent difficulties ofclinical workflows have made them an interesting subject of investigation [32, 33, 34].

    Some of the challenges of analyzing the execution of clinical processes can be addressed by usingthe principles of process mining, in order to extract process models, understand the structure of aprocess and extract other knowledge provided by process mining techniques, like social networksfor example.

    For this experiment we developed an event log, based on high level activities usually executed inthe treatment of patients. The log contains common procedures followed from the arrival of thepatient to the discharge. We also introduced some noise and incomplete executions of a process tobetter depict the real scenarios that occur in hospitals. The objective is to employ the techniquesimplemented to understand the different types of executions present in the log and understandthe context in which they occur.

    Figure 4.9 shows the Markov chain for the entire event log. There are a total of sixteen eventswith several connections between them. By analyzing the model we can conclude that the firstevent is always the arrival of the patient, but other than that it is not clear what kind of patternsthe execution of the process follows.

    31

  • Fig

    ure

    4.9:

    Mod

    elfo

    rth

    ein

    itia

    llo

    gof

    the

    Pat

    ient

    Tre

    atm

    ent

    Pro

    cess

    32

  • Letter EventA Arrival at HospitalB Check Patient IdentityC Register PatientD Fill out Patient HistoryE Analyze Patient RecordsF Access Patient ConditionG Admit PatientH Administer TreatmentI Discharge PatientJ Refer to Family DoctorL Order TestsM Analyze ResultsN Refer to SpecialistO Transfer PatientP Contact FamilyQ Contact Authorities

    Table 4.1: Correspondence between letters and events

    To improve the comprehension of the log, the first action taken was to analyze the occurrenceof the different events (Figure 4.10). To facilitate the visualization of some figures we representthe events with a letter and the correspondence between each letter and each event is shown inTable 4.1. In Figure 4.10 it is clear that some events (namely the last four) rarely occur andcan therefore be eliminated in the preprocessing stage, considering that the recurrent patterns ofbehavior is what we intend to determine.

    After determining the most relevant events we turned our attention to the sequences recorded inthe log. In Figure 4.11 we present the occurrences for the different sequences. Even with theremoval of rare events there are unusual process instances, mainly due to incomplete executions ofa process or to the execution of events in unusual orders. Given the occurrences of the sequences,we established that it would be acceptable and helpful to remove sequences occurring below 1%of the times, therefore the preprocessing parameter for the minimum sequence occurrence was setto 20 (from the initial 2000 instances, 1926 were kept).

    Finally, after preprocessing the process instances the sequence clustering algorithm was appliedwith the number of clusters set to three. Figures 4.12 to 4.14 present the cluster models achieved.The comprehensibility of the models increased and conclusions were taken about the significanceof the behavior present in each cluster.

    33

  • Figure 4.10: Events present in the log of the Patient Treatment Process

    Figure 4.11: Sequences present in the log of the Patient Treatment Process

    34

  • Fig

    ure

    4.12

    :C

    lust

    er1:

    Pat

    ient

    Tre

    atm

    ent

    Pro

    cess

    Fig

    ure

    4.13

    :C

    lust

    er2:

    Pat

    ient

    Tre

    atm

    ent

    Pro

    cess

    Fig

    ure

    4.14

    :C

    lust

    er3:

    Pat

    ient

    Tre

    atm

    ent

    Pro

    cess

    35

  • In Cluster 1 the first conclusion is that after the arrival at the hospital the identity of the patientis checked. If it is the first time that patient comes to the hospital he is registered and his medicalhistory is filled, otherwise the existing patient records are analyzed. After this, the condition ofthe patient is examined and either the patient is admitted or referred to a specialist. In either casethe situation is probably not simple. We can also conclude that several times tests are needed todetermine the condition of the patient.

    The main behavioral difference between Cluster 1 and Cluster 2 is that in the last one thepatient is not admitted. This indicates simpler problems that can be treated without the patientstaying in the hospital. Another important fact that stands out from this model is that afteranalyzing the results of some medical test there is in certain occasions the need to order moretests. This can indicate the need to further analyze the situation with other tests, but it can alsoindicate errors in the choice of the tests ordered or even in the execution of those tests.

    Finally the main characteristic of the behavior present in Cluster 3 can be seen in the eventsbeginning the process. Instead of performing the usual identity analysis the patient condition isimmediately checked or in some cases the patient is immediately treated. This pattern suggestthe presence of emergency situations, where treating the patient as fast as possible takes priority.The fact that the process always ends with the patient being admitted and later being dischargedalso suggests that the condition of the patient was serious.

    This experiment has shown the value of using our solution in flexible and complex environments.The importance of the preprocessing stage is noticeable and the sequence clustering algorithmproved to perform an effective division of the log, given that the behavior assigned to each clusterfollowed a specific pattern of behavior.

    4.3 Telephone Repair Process: Comparing Clustering Meth-

    ods

    In the previous experiments the goal was to understand the importance of sequence clustering inprocess mining and to demonstrate the application of the techniques. In this third experiment thegoal is to evaluate the sequence clustering algorithm in comparison with other clustering methodsavailable in ProM.

    For this purpose we use the Telephone Repair Process from the ProM tutorial [35], also availablein the website of the framework. The process starts by registering a telephone device sent by acustomer. After that the Problem Detection department analyzes and categorizes the defect. Theproblem is then sent to the Repair department where one of two teams will handle it, dependingon whether it is a simple problem or a complex one. After the repair is finished, the QualityAssurance department makes sure that the problem was indeed fixed. If it was not fixed theproblem is sent back to the Repair department, otherwise the case is archived and the telephoneis sent to the customer. If after a certain amount of tries to fix the problem, it is still not fixed thecase is archived and a new telephone is sent to the customer. Twelve different activities can befound in the log that registers the executions of this process. Different types of process instance

    36

  • Cluster No i Method 1 Method 2 Method 3ni ci si ni ci si ni ci si

    0 271 32 14 156 23 15 302 8 71 213 13 10 213 13 10 249 13 92 299 7 5 211 8 6 269 12 93 321 21 14 524 2 2 284 3 3

    Sum 1104 73 43 1104 46 33 1104 36 28

    Table 4.2: Complexity metrics of the process models from the clusters generated by the threedifferent clustering methods

    are present in the log, mainly due to the flexibility of the process when considering the order inthe execution of activities.

    The objective of clustering algorithms when applied to process mining is to ease the discovery ofprocess models by grouping together instances that follow similar execution patterns. Therefore,to evaluate the quality of the results we analyze the process models discovered for the clusters.Better formed clusters tend to group instances such that the cluster models are less complex.

    The following clustering methods are studied:

    • Method 1: K-means Clustering

    • Method 2: Agglomerative Hierarchical Clustering

    • Method 3: Sequence Clustering

    The first two methods are available in the trace clustering approach present in ProM and the lastmethod is the sequence clustering algorithm implemented. The objective of this experiment isto compare clustering algorithms, therefore when testing the sequence clustering algorithm thepreprocessing stage was not used, making the test conditions equal for the three algorithms. Themethod of evaluation followed to compare the results of different clustering methods is based onthe methodology found in [22]. The first step of this methodology is to apply each of these methodsto the set of instances present in the telephone repair process event log. Then the α-algorithmdiscussed earlier is used to generate models for the clusters formed and finally the Petri-NetComplexity Analysis plug-in also available in ProM is used to evaluate the complexity of theresulting models. This plug-in generates metrics such as the number of control flows, and-joins,and-splits, xor-joins and xor-splits. When the value of these metrics is higher the models are morecomplex, therefore better formed clusters originate models with lower values for these metrics.The relationship between these metrics and the comprehensibility of models has been studied in[36].

    The results achieved in this experiment are presented in Table 4.2, in which ni signifies the numberof instances in cluster i, ci signifies the number of control flows in the process model of cluster iand si signifies the sum of and/xor joins/splits. When comparing the results, we can conclude thatthe sequence clustering algorithm has better performance, generating clusters with less complexmodels. The agglomerative hierarchical clustering has the next best performance and K-means

    37

  • clustering has the worst performance amongst the three. The direct clustering of sequences andthe probabilistic nature that characterize the sequence clustering algorithm, prove to be valuablewhen clustering sequences in the domain of process mining, where the execution of a processusually presents several variations.

    In this experiment the α-algorithm was used to generate the process models, however other miningtechniques that generate process models capable of being studied by the metrics employed couldbe used. Also, other clustering methods could be used to be compared with sequence clustering.K-means and agglomerative clustering were chosen for being amongst the most studied in theliterature relating to our work and also because the only input needed is the number of clusters(like the sequence clustering algorithm).

    In this chapter three experiments were presented, with the intent of demonstrating the use of ourapproach and understanding the usefulness of its application. The first two experiments consistedof two types of processes with some characteristics that can usually be found in common real-worldsituations. These situations are known to cause difficulties in the analysis of processes and it wasproven that our approach is capable of dealing with them, making it possible to perform a valuableanalysis. In the third experiment a known log from the ProM tutorial was used to compare theresults achieved with different clustering methods. The sequence clustering algorithm proved to bea good choice to apply in the process mining area. After the results achieved in these experiments,the approach was ready to be applied in a real-world scenario.

    38

  • CHAPTER 5

    Case Study: Application Server Logs

    The previous chapter allowed us to understand the usefulness of using sequence clustering tech-niques in confusing processes and the methodology used to apply both the preprocessing techniqueand the sequence clustering technique. In this chapter we demonstrate the use of our solution ina real-world situation, to understand the benefits of applying these techniques to real logs.

    5.1 Case study description

    Public administration and public services often have large-scale IT systems that serve thousands ofusers. These systems are usually backed by an infrastructure that involves replication, redundancyand load-balancing. Due to the large number of replicated software applications and due to thelarge number of simultaneously connected users, it becomes exceedingly difficult to determinethe cause for some malfunctions that produce instabilities that propagate across the system andnegatively affect the experience of several users at the same time.

    We present one such case study based on the experience at a public institution. At the time theinstitution was struggling with complaints about a situation in which the applications would freezeor crash unexpectedly for several users at the same time. The applications are Java-based andwere developed according to a client/server architecture where the end users have a fat client andthe back-end was implemented as a set of Enterprise JavaBeans hosted in an application serverthat has been replicated across a server farm.

    Figure 5.1 illustrates the system infrastructure. There are three server machines, each one run-ning ten instances of an application server where the server-side application code is installed.The server-side application code connects to and operates over a large, common database. All

    39

  • Figure 5.1: System infrastructure of a public institution

    application server instances share a common database connection pool so that there is a limitednumber of database connections at any point in time. When users place requests through theirclient-side applications, the request is forwarded to one of the three server machines accordingto the policy configured in the load-balancer. When the request reaches one machine, there isan internal load-balancer (not shown) that assigns the request to one of the available applicationserver instances.

    When a client-side application froze or crashed it was possible to determine that the applicationserver instance that was servicing that client had indeed thrown one or more exceptions. But itwas unclear whether these exceptions were the cause or the effect of another problem. Besides, thisbehavior seemed to occur in several instances at the same time, i.e. at some point in time severalinstances would become unstable, making all their clients freeze or crash. These instabilitiescorresponded to bursts of exceptions being thrown on the server-side during a period of time untilthe application server instance was automatically restarted.

    In general, every application server instance is constantly logging its own debug messages andexceptions, most of which are handled correctly by the server-side application code or, in theworst case, by the application server itself. During these periods of instability it was noticeablethat different patterns of exceptions emerged. However, since they were mixed up with normalbehavior, and since the amount of exceptions being recorded at all times is overwhelming, it was

    40

  • server.log_2009-03-13T09-40-29 2009-03-12T19:12:54.135+0000 _ThreadID=23; WSException

    server.log_2009-03-13T09-40-29 2009-03-12T19:13:18.145+0000 _ThreadID=23; BusinessException

    server.log_2009-03-13T09-40-29 2009-03-12T19:13:18.145+0000 _ThreadID=173; SystemException

    server.log_2009-03-13T09-40-29 2009-03-12T19:14:20.189+0000 _ThreadID=21; BusinessException

    server.log_2009-03-13T09-40-29 2009-03-12T19:14:20.189+0000 _ThreadID=21; EmptyResultException

    server.log_2009-03-13T09-40-29 2009-03-12T19:18:34.128+0000 _ThreadID=17; EmptyResultException

    server.log_2009-03-13T09-40-29 2009-03-12T19:18:34.128+0000 _ThreadID=17; NoRecordsFoundException

    server.log_2009-03-13T09-40-29 2009-03-12T19:19:00.785+0000 _ThreadID=155; Exception

    server.log_2009-03-13T09-40-29 2009-03-12T19:19:00.785+0000 _ThreadID=155; NoRecordsFoundException

    server.log_2009-03-13T09-40-29 2009-03-12T19:20:20.339+0000 _ThreadID=410; AFException

    server.log_2009-03-13T09-40-29 2009-03-12T19:21:09.291+0000 _ThreadID=172; ExtServiceException

    server.log_2009-03-13T09-40-29 2009-03-12T19:21:09.292+0000 _ThreadID=172; BusinessLogicException

    Figure 5.2: Application Server Logs Snapshot

    quite difficult to establish any causal relationship between those exceptions. We therefore turnedour attention to using sequence clustering as a way to identify and characterize the differentpatterns of exceptions that occurred during periods of instability.

    5.2 Log Structure

    The first step was to understand the structure of the logs generated by the application servers, tobe able to convert them to the MXML format accepted by ProM. In Figure 5.2 a small portion ofthose original logs is presented. Each line corresponds to a different event and has four elementsassociated. The first element is the name of the file that contains the entry for this event. Tocontrol the size of the logs each time the log reaches 50MB a new file is created. The secondelement corresponds to the timestamp of the event and the third element identifies the thread inwhich the event was executed. Finally the fourth element corresponds to the name of the event.The application server logs contain different types of events. For the purpose of this case studywe only consider the events related to exceptions that occurred during runtime.

    To convert this type of log to the MXML format, the name and the timestamp of the event weredirectly converted to the corresponding fields. The threads were then used to form the processinstances. Each thread contains a sequence of events as does a process instance, therefore for eachthread a process instance was created.

    After this conversion we attempted to analyze the log with different techniques available in ProMand we realized that the overwhelming size of the process instances (in terms of sequence length)was affecting the performance of those techniques, causing most of them to crash (such as theheuristics miner). With further analysis of the instances recorded in the logs, we were able todetermine that one of the causes of their abnormal size was the repetition of events in the sameinstance. To address this problem, we returned to the original logs and instead of directly trans-forming each thread into a process instance, we split them into a different instance whenever arepetition of an event would occur. An example of such an operation is splitting the sequence“ABCDBCDA” into the sequences “ABCD” and “BCDA”. The result was a larger set of sequenceswith more acceptable sizes, preserving the order in which events occurred.

    41

  • With this set of instances the process mining techniques started to produce some results. Figure5.3 depicts the result of an attempt to analyze them using the heuristics miner [37]. Even withan extensive analysis of the figure, the attempt to identify some kind of pattern in the wayexceptions occurred proved to be pointless. One thing that stands out in the model is that theevent “Exception” (the event on the top) occurs frequently, preceded and succeeded by manydifferent events. The next step in order to understand the behavior present in the log was toapply our approach.

    5.3 Preprocessing stage

    A first attempt to cluster the log did not result in the identification of any useful patterns, thereforewe turned to the preprocessing technique. When applying this technique with different parameterstwo different types of behavior that were affecting the identification of patterns were identified. Thefirst was the one corresponding to rare behavior, mainly due to noise or ad-hoc behavior. Thesetype of events can be removed in order to improve the comprehensibility of the results achieved.The second was composed of recurrent events like the one mentioned earlier (“Exception”). Theseevents occurred in most of the process instances and in different moments (followed and precededby most of the remaining events), which consequently complicated the identification of relationsbetween different instances. In Figure 5.4 the exception type events present in the log and theirfrequency are presented. By analyzing the figure, the types of behavior mentioned can be clearlyidentified. In the top of the graph there are events with a high occurrence (especially the first twoevents) and in the bottom part of the graph there are several events that rarely occur.

    By analyzing this graph and the results of the first attempt to cluster the whole log, we concludedthat in order to identify some kind of pattern we had to analyze the log in different angles, usingthe preprocessing technique to keep only some of the events. By using the maximum and minimumevent occurrence options, we separated the most recurrent behavior, the middle recurrent behaviorand the least recurrent behavior. After this we applied the sequence clustering algorithm in orderto understand what kind of patterns guide the execution of these three different categories ofevents. Given that two of these types of event correspond to the highly recurrent or highly rarebehavior mentioned earlier, we paid special attention to the middle recurrent behavior where weexpected to find the behavioral patterns searched.

    When analyzing the least and the middle recurrent behavior, a certain amount of caution mustbe taken, because the most recurrent behavior might be essential to the execution of a process.Therefore this should be done carefully and only if the situation calls for it, as in this case wherea specific type of behavioral pattern is what we want to identify and for that a reduced level ofaccuracy is acceptable.

    5.4 Sequence Clustering results

    After the previous preprocessing steps the sequence clustering algorithm was applied, along withthe possibility of visually adjusting the cluster models according to certain thresholds. It was

    42

  • Exception(complete)

    187

    EstabelecimentoNotFoundException(complete)

    187

    0,991 152

    GREJBPersistencyException(complete)

    179

    0,909 159

    PGWSException(complete)

    168

    0,889 12

    ITPTExternalServiceException(complete)

    183

    0,944 162

    SIPSCNoRecordsFoundException(complete)

    160

    0,8 5

    PessoaSingularNotFoundException(