advanced techniques for mining structured data: process mining

77
Advanced Techniques for Mining Structured Data: Process Mining PROM Dr A. Appice Scuola di Dottorato in Informatica e Matematica XXXII

Upload: others

Post on 16-Apr-2022

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Advanced Techniques for Mining Structured Data: Process Mining

Advanced Techniques for MiningStructured Data: Process Mining

PROMDr A. Appice

Scuola di Dottorato in Informatica e Matematica XXXII

Page 2: Advanced Techniques for Mining Structured Data: Process Mining

Software

• Download and install PROM 6.6

• Require JRE 7 (64 bit/32 bit)

Page 3: Advanced Techniques for Mining Structured Data: Process Mining

Repair.xes

Page 4: Advanced Techniques for Mining Structured Data: Process Mining

Import a log file

1. How many cases (or process instances) are in the log?

2. How many tasks (or audit trail entries) are in the log?

3. How many originators are in the log?

4. Are there running cases in the log?

5. Which originators work in which tasks?

Page 5: Advanced Techniques for Mining Structured Data: Process Mining

Import File log

Page 6: Advanced Techniques for Mining Structured Data: Process Mining

Import File log

Page 7: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• Since our focus in on the process as a whole, we will base our analysis on the completed process instances only.

• It does not make much sense to talk about the most frequent path if it is not complete, or reason about throughput time of cases when some of them are still running. In short, we need to pre-process (or clean or filter) the logs.

Page 8: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• A log can be filtered by applying the provided Log Filters.

1. Select «use resource»

2. actions that only take a log

as input are colored green

Page 9: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• A log can be filtered by applying the provided Log Filters.

3. Select the “Filter Log using Simple Heuristics” action and select the “Start” button. This will start the “Filter Log using Simple Heuristics” action on the example log. This action combines a number of log filters, that can be configured individually using a wizard.

Page 10: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• A log can be filtered by applying the provided Log Filters.

3. Select the “Filter Log using Simple Heuristics” action and select the “Start” button.

Page 11: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• The first log filter to configure is the event type filter, which allows us to select the type of events (or tasks or audit trail entries) that we want to consider while mining the log. For our running example, the log has tasks with two event types: complete and start (lifecycle:transition). • keep all tasks of a certain event, you should select the option “keep”, • omit the tasks with a certain event type from a trace, select the option

“remove”, and • discard all traces with a certain event type, select the option “discard

instance”. This last option may be useful when you have aborted cases etc.

• Options can be selected by clicking on an event type. When done, select the “Next” button.

Page 12: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• e.g. remove the events with “complete”

Page 13: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• The second filter is the start event filter, which filters the log in such a way that only the traces (or cases) that start with the indicated tasks are kept.

• The slider at the bottom allows us to select the most frequent start events. For example, if this slider is set to “80%”, then the most frequent start events will be selected until at least 80% of the traces is covered. As all traces (after removing events with complete”) start with “Analyze defect+start”, this step is straightforward. When done, select the “Next” button.

Page 14: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• e.g. remove the traces that do not start with “Analyze defect + start”

Page 15: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• The third filter is the end event filter, which filters the log in such a way that only the traces (or cases) that end with the indicated tasks are kept.

Page 16: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• The fourth filter is the event filter, which filters all unselected events from the log.

Page 17: Advanced Techniques for Mining Structured Data: Process Mining

Cleaning the Log

• To save the cleaned file , button “Export to disk” in the main workspace

Page 18: Advanced Techniques for Mining Structured Data: Process Mining

Mining control-flow perspective

The control-flow perspective of a process establishes the dependenciesamong its tasks.

• Which tasks precede which other ones?

• Are there concurrent tasks?

• Are there loops?

• In short, what is the process model that summarizes the flow followed by most/all cases in the log?

This information is important because it gives you feedback about howcases are actually being executed in the organization.

Page 19: Advanced Techniques for Mining Structured Data: Process Mining

Alpha algorithm

bc NO since b>c and c>b

b||c YES since b>c and c>b

Page 20: Advanced Techniques for Mining Structured Data: Process Mining

Alpha algorithm: simple idea

• Simple patterns

Page 21: Advanced Techniques for Mining Structured Data: Process Mining

Alpha algorithm: example

Page 22: Advanced Techniques for Mining Structured Data: Process Mining

Alpha algorithm : example

Page 23: Advanced Techniques for Mining Structured Data: Process Mining

Alpha algorithm: another example

Page 24: Advanced Techniques for Mining Structured Data: Process Mining

Running Alpha algorithm

In alternative, consider the plug-in“Flexible Heuristics Miner”, “Inductive Miner” “Fuzzy Miner” to mine real-life noised logs.

Page 25: Advanced Techniques for Mining Structured Data: Process Mining

Alpha miner (classic)

Wil M. P. van der Aalst, Ton Weijters, Laura Maruster:Workflow Mining: Discovering Process Models from Event Logs. IEEE Trans. Knowl. Data Eng. 16(9): 1128-1142 (2004)

This processhas a loop!!!

Page 26: Advanced Techniques for Mining Structured Data: Process Mining

Conformance checking

• Conformance checking techniques investigate how well an event log L( a bag of cases) and a process model M (e.g. a Petri net) fit together.• It may be used to see whether reality conforms to a normative or

descriptive model • Deviations may point to fraud, inefficiencies, and poorly designed or

outdated procedures.

• It may be used to evaluate process discovery results.

Page 27: Advanced Techniques for Mining Structured Data: Process Mining

Conformance checking

• Conformance checking requires an alignment of event log and process model, i.e., events in the event log need to be related to model elements and vice versa.

• Such an alignment shows how the event log can be replayed on the process model.

• !!!! The log may deviate from the model and not all activities may have been modelled or recorded

Page 28: Advanced Techniques for Mining Structured Data: Process Mining

Alignment

• To establish an alignment between process model and event log we need to relate “moves” in the log to “moves” in the model

• It may be the case that some of the moves in the log cannot be mimicked by the model and vice versa• Denote «no move» as «»• AL: x AL corresponds to “move x in L”• AM: x AM corresponds to “move x in M”

• One step in an alignment is represented by a pair (x,y) defined as follow:• (x; y) is a move in log if x AL and y = , -- legal moves• (x; y) is a move in model if x = and y AM, -- legal moves• (x; y) is a move in both if x AL and y AM, -- legal moves• (x; y) is an illegal move x = and y = .

Page 29: Advanced Techniques for Mining Structured Data: Process Mining

Alignment

• Let us consider:• σ(L) a case of log L,

• σ(M) a full execution of a model M

• An alignment of σ(L) and σ(M) is a sequence of legal moves such that the projection on the first element (ignoring ) yields σ(L) and the projection on the second element (again ignoring ) yields σ(M).

• Let us consider σ(L) = abdeg and σ(M) = acdeh:

the alignment is:

Page 30: Advanced Techniques for Mining Structured Data: Process Mining

Distance function

where:

Alignment has distance equal to 4.

• Various cost functions can be defined. For example, the weights of activities may vary. Skipping a payment may have a much higher cost than skipping some minor check.

• The cost function can also be used to compare activities which are not identical but similar, e.g., in the example we may define (b; c) = (c; b) = 0:8 because both b and c examine the request and are exchangeable to some degree

Page 31: Advanced Techniques for Mining Structured Data: Process Mining

Fittness

• A model has a perfect fitness (f=1) if all traces in the log can be replayed by the model from beginning to end

van der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.

Optimal alignment provided by the oracle

Page 32: Advanced Techniques for Mining Structured Data: Process Mining

Replay a Log on Petri Net for ConformanceAnalysis

Page 33: Advanced Techniques for Mining Structured Data: Process Mining

Avoid under fitting - precision

• A model is precise if it does not allow for “too much” (alternative) behaviour. A model that is not precise is “under fitting”.

• Under fitting is the problem that the model over-generalizes the example behaviour seen in the log.

A model with lowprecision

Page 34: Advanced Techniques for Mining Structured Data: Process Mining

Precision

• To compute precision we have to identify situations where too manytransitions are possible

The log represented as a collection of uniqueevents

van der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.

Page 35: Advanced Techniques for Mining Structured Data: Process Mining

Precision

• stateM(e) S is the state in M just before the occurrence of e.

• contextL(e) AM is the activity prefix of the process instance just before the occurrence of e, i.e., the sequence of all activities that happened before event e.

• enM(e) M is the set of activities enabled in stateM(e) (i.e. that can be performed from this state),

• enL(e) M is the set of activities that were executed in the same context

Page 36: Advanced Techniques for Mining Structured Data: Process Mining

Precision

• By taking the average over all events, we automatically take frequencies into account.

• If the model has an activity that is enabled on a frequent path but the activity is never executed, then this is more severe than an unused activity enabled along an infrequent path.

Page 37: Advanced Techniques for Mining Structured Data: Process Mining

Measure Precision/Generalization

The result of the plugin (PNRepResult object) can be used as an input to plugin "Measure Precision/Generalization”

Page 38: Advanced Techniques for Mining Structured Data: Process Mining
Page 39: Advanced Techniques for Mining Structured Data: Process Mining

Avoid over fitting: Generalization

• A process model should not restrict behaviour to just the examples seen in the log.

• A model that does not generalize is “over fitting”.

• Over fitting is the problem that a very specific model is generated whereas it is obvious that the log only holds example behaviour • the model explains the particular sample log, but it is unlikely that another

sample log of the same process can be explained well by the current model).

Page 40: Advanced Techniques for Mining Structured Data: Process Mining

Overfitting

Page 41: Advanced Techniques for Mining Structured Data: Process Mining

Generalizationvan der Aalst, W.M.P., Adriansyah, A., & Dongen, B.F. van. Replaying History on Process Models for Conformance Checking and Performance Analysis. WIREs Data Mining Knowl Discov 2012, 2: 182-192. doi: 10.1002/widm.1045.

where:

• pnew(w,n) is the estimated probability that a next visit to state s = stateM(e) will reveal a new path

not seen before

• w=|dif(e)| is the number of unique activities observed leaving state s

• n=|sim(e)| is the number of times s was visited by the event log.

otherwise

Page 42: Advanced Techniques for Mining Structured Data: Process Mining

Mining case-related information

• What are the most frequent paths in the process?

• Are there any loop patterns in the process?

• What is the distribution of all cases over the different paths through the process?

• Can I select a subset of traces where particular paths were executed?

• Can I simplify the log by abstracting the most frequent paths?

Page 43: Advanced Techniques for Mining Structured Data: Process Mining

Mining case-related information: how-to?

• Show the log in the view resource

• Select the “Visualizer” button. By default, you will notice the “Log visualizer”. Change the visualizer to “Pattern Abstractions (New)”

“Log visualizer” “Pattern Abstractions (New)”

Page 44: Advanced Techniques for Mining Structured Data: Process Mining

Pattern Abstractions

• Choose the “tandem arrays” (loop patterns) and click “Find Patterns".

• In order to find frequent paths not involving loops, choose “(Maximal) Repeat Patterns“ and Click the “Find Pattern Frequency" button

• Uncheck the “Ignore Duplicate Traces" checkbox in the “Filter Patterns“ panel

Page 45: Advanced Techniques for Mining Structured Data: Process Mining

In 90% of cases, «ArchiveRepair-Complete» follows «TestRepair-Start»

• What are the most frequent paths in the process?

• What is the distribution of all cases over the different paths through the process?

Number of times the alphabet set occurs in the log

Page 46: Advanced Techniques for Mining Structured Data: Process Mining

Conservedness

• The degree of which the individual activities involved in the pattern alphabet manifest as the patterns defined by the alphabet

CON(P) =NOAC/m(1-s/m), where• NOAC is the non-overlapping alphabet counting• m and s are the mean and standard deviation of the frequency of activity in

pattern P

Li, J., Bose, R.J.C., van der Aalst, W.M.: Mining Context-Dependent and In-teractive Business Process Maps using Execution Patterns. Technical report,University of Technology, Eindhoven (2010)

Page 47: Advanced Techniques for Mining Structured Data: Process Mining

Base patterns

• A pattern that does not contain any other pattern within it

Page 48: Advanced Techniques for Mining Structured Data: Process Mining

Tandem array

• A tandem array in a trace T is a sub-sequence (i,,k) so that occursk in T starting from i

T= gd abc abc abc abc afi

(3,abc,4) is a tandem ARRAY

Page 49: Advanced Techniques for Mining Structured Data: Process Mining

Tandem array

• A tandem array in a trace T is a sub-sequence (i,,k) so that occursk in T starting from i

T= gd abc abc abc abc afi

(3,abc,4) is a tandem ARRAY

Page 50: Advanced Techniques for Mining Structured Data: Process Mining

Maximal Tandem array

• A tandem array with no additional copies of before and after (i, ,k)

Page 51: Advanced Techniques for Mining Structured Data: Process Mining

Pattern filters

Filter-out patterns which are supported by less than 60% of casess

Page 52: Advanced Techniques for Mining Structured Data: Process Mining

Loop discovery

1

2 3

4

5

LOOP

Are there any loop patterns in the process?

Page 53: Advanced Techniques for Mining Structured Data: Process Mining

Pattern preference

• To solve conflicts that arise when a pattern can contribute to the count of a region in a trace• shorter/longer to solve the conflict

• Conflicts may occur with NOAC (Non-Overlapping Alphabet Counting)

abxcdxedfxgdxeh

Let us consider the pixel subsequence dxe t(4,6), it may contribute to both (dxe) e (dxed)

• Both (dxe) and (dxed) are defined on the alphabet set {d,x,e}

dxe is counted on (dxe) with the Pattern preference “Shorter”• (dxe) is preferred to (dxed)) when the option is shorter

Page 54: Advanced Techniques for Mining Structured Data: Process Mining

Pattern abstraction

• Subprocess abstractions can be discovered by considering a partial ordering on the repeat alphabet (cover relation).• The idea is that the repeated occurrences of the manifestation of the loop may be

replaced by an abstracted entity (activity) that encodes the notion of a loop

• Let r be a repeat alphabet • r denotes the set of symbols/activities that appear in the repeat. • {a,b} is the repeat alphabet of the repeat abba

• Different repeats can share a common repeat alphabet.• abdgh and adgbh share the same repeat alphabet : a; b; d; g; h

• We can define the equivalence class E(r)={R| R is a repeat with alphabet r}• E({a,b,d,g,h}} = {fabdgh, adgbhg}• The equivalence class under repeat alphabet will capture any variations in the

manifestation of a process execution due to parallelism.

Jagadeesh et al. Abstractions in Process Mining: A Taxonomy of Patterns, BPM 2009

Page 55: Advanced Techniques for Mining Structured Data: Process Mining

Pattern abstraction

• A repeat alphabet ra1 is set to cover another repeat alphabet ra2 if ra2 ra1• {a,b} {a,b,c}

• For example, consider the repeat types abcd and abd. It is most likely for activity c to represent a functionality similar to that of a, b, and d, since c occurs within the context of a, b and d.

• By defining a partial order on the repeat alphabets and generating a Hasse diagram on the partial ordering, one can form abstractions by considering the maximal elements in the poset.

Page 56: Advanced Techniques for Mining Structured Data: Process Mining

Hasse diagram

• A mathematical diagram used to represent a finite partially ordered set (poset), in the form of a drawing of its transitive reduction.

• For a partially ordered set (S, ≤) one represents each element of S as a vertex in the plane and draws a line segment or curve that goes upward from x to y whenever y covers x (that is, whenever x < y and there is no z such that x < z < y).

• These curves may cross each other but must not touch any vertices other than their endpoints. Such a diagram, with labeled vertices, uniquely determines its partial order.

Page 57: Advanced Techniques for Mining Structured Data: Process Mining

Hasse diagram: example

• The power set of a 2-element set ordered by inclusion

Page 58: Advanced Techniques for Mining Structured Data: Process Mining

Pattern abstraction

• Poset: {{a,b} {a,b,c}, {a,c} {a,b,c}, {b,c} {a,b,c}, {a,c} {a,c,d}, {a,d}{a,c,d}}

• Maximal elements:{a,b,c} DEFINE Abstraction A ; {a,c,d} DEFINE Abstraction B

Page 59: Advanced Techniques for Mining Structured Data: Process Mining

Pattern abstraction

• It may happen that maximals share a lot of activities.

• In order to reduce the total number of abstract activities introduced, one can define extended joins on the maximal elements.

Page 60: Advanced Techniques for Mining Structured Data: Process Mining

Criteria to perform extend joins

• Extend two maximal elements provided they share a set of common elements above a particular threshold and also when the differences between them is less.

Page 61: Advanced Techniques for Mining Structured Data: Process Mining

Abstraction

maximals

Can I simplify the log by abstracting the most frequent paths?

Page 62: Advanced Techniques for Mining Structured Data: Process Mining

Export selected log

Can I select a subset of traces where particular paths were executed?

Page 63: Advanced Techniques for Mining Structured Data: Process Mining

Mining Organizational-Related Information about a Process• How many people are involved in a specific case?

• What is the communication structure and dependencies among people?

• How many transfers happen from one role to another role?

• Who are important people in the communication floow? (the most frequent flow)

• Who subcontracts work to whom?

• Who work on the same tasks?

Page 64: Advanced Techniques for Mining Structured Data: Process Mining

Resources How many people are involved in a specific case?

Page 65: Advanced Techniques for Mining Structured Data: Process Mining

Social network miner

• What is the communication structure and dependencies among people?

• How many transfers happen from one role to another role?

• Who are important people in the communication floow? (the most frequent flow)

• Who subcontracts work to whom?

• Who work on the same tasks?

Page 66: Advanced Techniques for Mining Structured Data: Process Mining

6628 Aprile 2014

Social networks

Process Mining MetricsFor each pair resources: handover of works, subcontracting, similar tasks, working together …

• A set of resources connected via social links

• Represented as a graph

Link weights ϵ [0,1]

Page 67: Advanced Techniques for Mining Structured Data: Process Mining

67

Handover of work• Direct or indirect hand of work from one

resource to another resource• Defined on direct graph with self loop

Sue Pete

John John

Page 68: Advanced Techniques for Mining Structured Data: Process Mining

Handover of work

• There is an hand of work from resource R1 to resource R2 in a case C if there are two consecutive events (A1,R1,t1) , (A2,R2,t2) with t1<t2 and no (A,R,t) with t1<t<t2

Page 69: Advanced Techniques for Mining Structured Data: Process Mining

Sub-contracting

John JohnMike

Page 70: Advanced Techniques for Mining Structured Data: Process Mining

7028 Aprile 2014

Similar tasks• Resource involved in the same activities may be

strongly related.• Defined based on the activity metrics,

undirected graph

Minkowski distanceHamming distancePearson correlation coefficent

Resource/Activity A B C D E Number of cases in whichresourceJohn performedactivity A

Page 71: Advanced Techniques for Mining Structured Data: Process Mining

Working together

• Resources working together in the same case are strongly connected

• Measured as the numeber of cases where R1 and R2 work togetheron the number of cases where resource R1 works

• Direct graph Sue works in cases 3,4,5Clare works in case 5

Page 72: Advanced Techniques for Mining Structured Data: Process Mining

Metrics for resources in a network

• Degree: the number of connections the node has to other nodes • Degre-in• Degree-out

• Closeness: a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes

• d(y,x) shortest path between y and x

• N number of nodes

normalized

Page 73: Advanced Techniques for Mining Structured Data: Process Mining

Metrics for resources in a network

• Degree: the number of connections held by a node• Degre-in (projected on in- links)

• Degree-out (projected on out- links)

• What it tells us: How many direct, ‘one hop’ connections each node has to other nodes within the network.

• When to use it: For finding very connected individuals, popular individuals, individuals who are likely to hold most information or individuals who can quickly connect with the wider network

Page 74: Advanced Techniques for Mining Structured Data: Process Mining

Metrics for resources in a network

• Closeness: a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes

d(y,x) = shortest path between y and x and N=number of nodes

• When to use it: For finding the resources who are best placed to influence the entire network most quickly (e.g. find influencers within a single cluster)

normalized

Page 75: Advanced Techniques for Mining Structured Data: Process Mining

Metrics for resources in a network

• Betweenness: a measure of centrality that measures the number of times a node lies on the shortest path between other nodes.

• What it tells us: This measure shows which nodes act as ‘bridges’ between nodes in a network. It does this by identifying all the shortest paths and then counting how many times each node falls on one.

• When to use it: For finding the individuals who influence the flow around a system.• Betweenness is useful for analyzing communication dynamics,

Page 76: Advanced Techniques for Mining Structured Data: Process Mining

Mining for a similar-Task Social Network

• Pearson correlation coefficent

Page 77: Advanced Techniques for Mining Structured Data: Process Mining

Mining for Handover work Social Network

Size by ranking (betweennes) Size by ranking (closennes)