advanced techniques for mining structured data: process mining
TRANSCRIPT
Advanced Techniques for MiningStructured Data: Process Mining
PROMDr A. Appice
Scuola di Dottorato in Informatica e Matematica XXXII
Software
• Download and install PROM 6.6
• Require JRE 7 (64 bit/32 bit)
Repair.xes
Import a log file
1. How many cases (or process instances) are in the log?
2. How many tasks (or audit trail entries) are in the log?
3. How many originators are in the log?
4. Are there running cases in the log?
5. Which originators work in which tasks?
Import File log
Import File log
Cleaning the Log
• Since our focus in on the process as a whole, we will base our analysis on the completed process instances only.
• It does not make much sense to talk about the most frequent path if it is not complete, or reason about throughput time of cases when some of them are still running. In short, we need to pre-process (or clean or filter) the logs.
Cleaning the Log
• A log can be filtered by applying the provided Log Filters.
1. Select «use resource»
2. actions that only take a log
as input are colored green
Cleaning the Log
• A log can be filtered by applying the provided Log Filters.
3. Select the “Filter Log using Simple Heuristics” action and select the “Start” button. This will start the “Filter Log using Simple Heuristics” action on the example log. This action combines a number of log filters, that can be configured individually using a wizard.
Cleaning the Log
• A log can be filtered by applying the provided Log Filters.
3. Select the “Filter Log using Simple Heuristics” action and select the “Start” button.
Cleaning the Log
• The first log filter to configure is the event type filter, which allows us to select the type of events (or tasks or audit trail entries) that we want to consider while mining the log. For our running example, the log has tasks with two event types: complete and start (lifecycle:transition). • keep all tasks of a certain event, you should select the option “keep”, • omit the tasks with a certain event type from a trace, select the option
“remove”, and • discard all traces with a certain event type, select the option “discard
instance”. This last option may be useful when you have aborted cases etc.
• Options can be selected by clicking on an event type. When done, select the “Next” button.
Cleaning the Log
• e.g. remove the events with “complete”
Cleaning the Log
• The second filter is the start event filter, which filters the log in such a way that only the traces (or cases) that start with the indicated tasks are kept.
• The slider at the bottom allows us to select the most frequent start events. For example, if this slider is set to “80%”, then the most frequent start events will be selected until at least 80% of the traces is covered. As all traces (after removing events with complete”) start with “Analyze defect+start”, this step is straightforward. When done, select the “Next” button.
Cleaning the Log
• e.g. remove the traces that do not start with “Analyze defect + start”
Cleaning the Log
• The third filter is the end event filter, which filters the log in such a way that only the traces (or cases) that end with the indicated tasks are kept.
Cleaning the Log
• The fourth filter is the event filter, which filters all unselected events from the log.
Cleaning the Log
• To save the cleaned file , button “Export to disk” in the main workspace
Mining control-flow perspective
The control-flow perspective of a process establishes the dependenciesamong its tasks.
• Which tasks precede which other ones?
• Are there concurrent tasks?
• Are there loops?
• In short, what is the process model that summarizes the flow followed by most/all cases in the log?
This information is important because it gives you feedback about howcases are actually being executed in the organization.
Alpha algorithm
bc NO since b>c and c>b
b||c YES since b>c and c>b
Alpha algorithm: simple idea
• Simple patterns
Alpha algorithm: example
Alpha algorithm : example
Alpha algorithm: another example
Running Alpha algorithm
In alternative, consider the plug-in“Flexible Heuristics Miner”, “Inductive Miner” “Fuzzy Miner” to mine real-life noised logs.
Alpha miner (classic)
Wil M. P. van der Aalst, Ton Weijters, Laura Maruster:Workflow Mining: Discovering Process Models from Event Logs. IEEE Trans. Knowl. Data Eng. 16(9): 1128-1142 (2004)
This processhas a loop!!!
Conformance checking
• Conformance checking techniques investigate how well an event log L( a bag of cases) and a process model M (e.g. a Petri net) fit together.• It may be used to see whether reality conforms to a normative or
descriptive model • Deviations may point to fraud, inefficiencies, and poorly designed or
outdated procedures.
• It may be used to evaluate process discovery results.
Conformance checking
• Conformance checking requires an alignment of event log and process model, i.e., events in the event log need to be related to model elements and vice versa.
• Such an alignment shows how the event log can be replayed on the process model.
• !!!! The log may deviate from the model and not all activities may have been modelled or recorded
Alignment
• To establish an alignment between process model and event log we need to relate “moves” in the log to “moves” in the model
• It may be the case that some of the moves in the log cannot be mimicked by the model and vice versa• Denote «no move» as «»• AL: x AL corresponds to “move x in L”• AM: x AM corresponds to “move x in M”
• One step in an alignment is represented by a pair (x,y) defined as follow:• (x; y) is a move in log if x AL and y = , -- legal moves• (x; y) is a move in model if x = and y AM, -- legal moves• (x; y) is a move in both if x AL and y AM, -- legal moves• (x; y) is an illegal move x = and y = .
Alignment
• Let us consider:• σ(L) a case of log L,
• σ(M) a full execution of a model M
• An alignment of σ(L) and σ(M) is a sequence of legal moves such that the projection on the first element (ignoring ) yields σ(L) and the projection on the second element (again ignoring ) yields σ(M).
• Let us consider σ(L) = abdeg and σ(M) = acdeh:
the alignment is:
Distance function
where:
Alignment has distance equal to 4.
• Various cost functions can be defined. For example, the weights of activities may vary. Skipping a payment may have a much higher cost than skipping some minor check.
• The cost function can also be used to compare activities which are not identical but similar, e.g., in the example we may define (b; c) = (c; b) = 0:8 because both b and c examine the request and are exchangeable to some degree
Fittness
• A model has a perfect fitness (f=1) if all traces in the log can be replayed by the model from beginning to end
van der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.
Optimal alignment provided by the oracle
Replay a Log on Petri Net for ConformanceAnalysis
Avoid under fitting - precision
• A model is precise if it does not allow for “too much” (alternative) behaviour. A model that is not precise is “under fitting”.
• Under fitting is the problem that the model over-generalizes the example behaviour seen in the log.
A model with lowprecision
Precision
• To compute precision we have to identify situations where too manytransitions are possible
The log represented as a collection of uniqueevents
van der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.
Precision
• stateM(e) S is the state in M just before the occurrence of e.
• contextL(e) AM is the activity prefix of the process instance just before the occurrence of e, i.e., the sequence of all activities that happened before event e.
• enM(e) M is the set of activities enabled in stateM(e) (i.e. that can be performed from this state),
• enL(e) M is the set of activities that were executed in the same context
Precision
• By taking the average over all events, we automatically take frequencies into account.
• If the model has an activity that is enabled on a frequent path but the activity is never executed, then this is more severe than an unused activity enabled along an infrequent path.
Measure Precision/Generalization
The result of the plugin (PNRepResult object) can be used as an input to plugin "Measure Precision/Generalization”
Avoid over fitting: Generalization
• A process model should not restrict behaviour to just the examples seen in the log.
• A model that does not generalize is “over fitting”.
• Over fitting is the problem that a very specific model is generated whereas it is obvious that the log only holds example behaviour • the model explains the particular sample log, but it is unlikely that another
sample log of the same process can be explained well by the current model).
Overfitting
Generalizationvan der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.
where:
• pnew(w,n) is the estimated probability that a next visit to state s = stateM(e) will reveal a new path
not seen before
• w=|dif(e)| is the number of unique activities observed leaving state s
• n=|sim(e)| is the number of times s was visited by the event log.
otherwise
Mining case-related information
• What are the most frequent paths in the process?
• Are there any loop patterns in the process?
• What is the distribution of all cases over the different paths through the process?
• Can I select a subset of traces where particular paths were executed?
• Can I simplify the log by abstracting the most frequent paths?
Mining case-related information: how-to?
• Show the log in the view resource
• Select the “Visualizer” button. By default, you will notice the “Log visualizer”. Change the visualizer to “Pattern Abstractions (New)”
“Log visualizer” “Pattern Abstractions (New)”
Pattern Abstractions
• Choose the “tandem arrays” (loop patterns) and click “Find Patterns".
• In order to find frequent paths not involving loops, choose “(Maximal) Repeat Patterns“ and Click the “Find Pattern Frequency" button
• Uncheck the “Ignore Duplicate Traces" checkbox in the “Filter Patterns“ panel
In 90% of cases, «ArchiveRepair-Complete» follows «TestRepair-Start»
• What are the most frequent paths in the process?
• What is the distribution of all cases over the different paths through the process?
Number of times the alphabet set occurs in the log
Conservedness
• The degree of which the individual activities involved in the pattern alphabet manifest as the patterns defined by the alphabet
CON(P) =NOAC/m(1-s/m), where• NOAC is the non-overlapping alphabet counting• m and s are the mean and standard deviation of the frequency of activity in
pattern P
Li, J., Bose, R.J.C., van der Aalst, W.M.: Mining Context-Dependent and In-teractive Business Process Maps using Execution Patterns. Technical report,University of Technology, Eindhoven (2010)
Base patterns
• A pattern that does not contain any other pattern within it
Tandem array
• A tandem array in a trace T is a sub-sequence (i,,k) so that occursk in T starting from i
T= gd abc abc abc abc afi
(3,abc,4) is a tandem ARRAY
Tandem array
• A tandem array in a trace T is a sub-sequence (i,,k) so that occursk in T starting from i
T= gd abc abc abc abc afi
(3,abc,4) is a tandem ARRAY
Maximal Tandem array
• A tandem array with no additional copies of before and after (i, ,k)
Pattern filters
Filter-out patterns which are supported by less than 60% of casess
Loop discovery
1
2 3
4
5
LOOP
Are there any loop patterns in the process?
Pattern preference
• To solve conflicts that arise when a pattern can contribute to the count of a region in a trace• shorter/longer to solve the conflict
• Conflicts may occur with NOAC (Non-Overlapping Alphabet Counting)
abxcdxedfxgdxeh
Let us consider the pixel subsequence dxe t(4,6), it may contribute to both (dxe) e (dxed)
• Both (dxe) and (dxed) are defined on the alphabet set {d,x,e}
dxe is counted on (dxe) with the Pattern preference “Shorter”• (dxe) is preferred to (dxed)) when the option is shorter
Pattern abstraction
• Subprocess abstractions can be discovered by considering a partial ordering on the repeat alphabet (cover relation).• The idea is that the repeated occurrences of the manifestation of the loop may be
replaced by an abstracted entity (activity) that encodes the notion of a loop
• Let r be a repeat alphabet • r denotes the set of symbols/activities that appear in the repeat. • {a,b} is the repeat alphabet of the repeat abba
• Different repeats can share a common repeat alphabet.• abdgh and adgbh share the same repeat alphabet : a; b; d; g; h
• We can define the equivalence class E(r)={R| R is a repeat with alphabet r}• E({a,b,d,g,h}} = {fabdgh, adgbhg}• The equivalence class under repeat alphabet will capture any variations in the
manifestation of a process execution due to parallelism.
Jagadeesh et al. Abstractions in Process Mining: A Taxonomy of Patterns, BPM 2009
Pattern abstraction
• A repeat alphabet ra1 is set to cover another repeat alphabet ra2 if ra2 ra1• {a,b} {a,b,c}
• For example, consider the repeat types abcd and abd. It is most likely for activity c to represent a functionality similar to that of a, b, and d, since c occurs within the context of a, b and d.
• By defining a partial order on the repeat alphabets and generating a Hasse diagram on the partial ordering, one can form abstractions by considering the maximal elements in the poset.
Hasse diagram
• A mathematical diagram used to represent a finite partially ordered set (poset), in the form of a drawing of its transitive reduction.
• For a partially ordered set (S, ≤) one represents each element of S as a vertex in the plane and draws a line segment or curve that goes upward from x to y whenever y covers x (that is, whenever x < y and there is no z such that x < z < y).
• These curves may cross each other but must not touch any vertices other than their endpoints. Such a diagram, with labeled vertices, uniquely determines its partial order.
Hasse diagram: example
• The power set of a 2-element set ordered by inclusion
Pattern abstraction
• Poset: {{a,b} {a,b,c}, {a,c} {a,b,c}, {b,c} {a,b,c}, {a,c} {a,c,d}, {a,d}{a,c,d}}
• Maximal elements:{a,b,c} DEFINE Abstraction A ; {a,c,d} DEFINE Abstraction B
Pattern abstraction
• It may happen that maximals share a lot of activities.
• In order to reduce the total number of abstract activities introduced, one can define extended joins on the maximal elements.
Criteria to perform extend joins
• Extend two maximal elements provided they share a set of common elements above a particular threshold and also when the differences between them is less.
Abstraction
maximals
Can I simplify the log by abstracting the most frequent paths?
Export selected log
Can I select a subset of traces where particular paths were executed?
Mining Organizational-Related Information about a Process• How many people are involved in a specific case?
• What is the communication structure and dependencies among people?
• How many transfers happen from one role to another role?
• Who are important people in the communication floow? (the most frequent flow)
• Who subcontracts work to whom?
• Who work on the same tasks?
Resources How many people are involved in a specific case?
Social network miner
• What is the communication structure and dependencies among people?
• How many transfers happen from one role to another role?
• Who are important people in the communication floow? (the most frequent flow)
• Who subcontracts work to whom?
• Who work on the same tasks?
6628 Aprile 2014
Social networks
Process Mining MetricsFor each pair resources: handover of works, subcontracting, similar tasks, working together …
• A set of resources connected via social links
• Represented as a graph
Link weights ϵ [0,1]
67
Handover of work• Direct or indirect hand of work from one
resource to another resource• Defined on direct graph with self loop
Sue Pete
John John
Handover of work
• There is an hand of work from resource R1 to resource R2 in a case C if there are two consecutive events (A1,R1,t1) , (A2,R2,t2) with t1<t2 and no (A,R,t) with t1<t<t2
Sub-contracting
John JohnMike
7028 Aprile 2014
Similar tasks• Resource involved in the same activities may be
strongly related.• Defined based on the activity metrics,
undirected graph
Minkowski distanceHamming distancePearson correlation coefficent
Resource/Activity A B C D E Number of cases in whichresourceJohn performedactivity A
Working together
• Resources working together in the same case are strongly connected
• Measured as the numeber of cases where R1 and R2 work togetheron the number of cases where resource R1 works
• Direct graph Sue works in cases 3,4,5Clare works in case 5
Metrics for resources in a network
• Degree: the number of connections the node has to other nodes • Degre-in• Degree-out
• Closeness: a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes
• d(y,x) shortest path between y and x
• N number of nodes
normalized
Metrics for resources in a network
• Degree: the number of connections held by a node• Degre-in (projected on in- links)
• Degree-out (projected on out- links)
• What it tells us: How many direct, ‘one hop’ connections each node has to other nodes within the network.
• When to use it: For finding very connected individuals, popular individuals, individuals who are likely to hold most information or individuals who can quickly connect with the wider network
Metrics for resources in a network
• Closeness: a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes
d(y,x) = shortest path between y and x and N=number of nodes
• When to use it: For finding the resources who are best placed to influence the entire network most quickly (e.g. find influencers within a single cluster)
normalized
Metrics for resources in a network
• Betweenness: a measure of centrality that measures the number of times a node lies on the shortest path between other nodes.
• What it tells us: This measure shows which nodes act as ‘bridges’ between nodes in a network. It does this by identifying all the shortest paths and then counting how many times each node falls on one.
• When to use it: For finding the individuals who influence the flow around a system.• Betweenness is useful for analyzing communication dynamics,
Mining for a similar-Task Social Network
• Pearson correlation coefficent
Mining for Handover work Social Network
Size by ranking (betweennes) Size by ranking (closennes)