Distributed Process Discovery and Conformance Checking

Post on 07-Nov-2014

263 views 1 download

Tags:

description

 

Transcript of Distributed Process Discovery and Conformance Checking

Distributed Process Discovery and Conformance Checking

prof.dr.ir. Wil van der Aalstwww.processmining.org

PAGE 1

On the different roles of (process) models …

Play-Out

PAGE 2

Play-Out (Classical use of models)

PAGE 3

A B C D

A C B DA B C D

A E D

A C B DA C B D

A E D

A E D

Play-In

PAGE 4

Play-In

PAGE 5

A C B DA B C D

A E D

A C B DA C B D

A E D

A E DA B C D

Example Process Discovery(Vestia, Dutch housing agency, 208 cases, 5987 events)

PAGE 6

Example Process Discovery(ASML, test process lithography systems, 154966 events)

PAGE 7

Example Process Discovery(AMC, 627 gynecological oncology patients, 24331 events)

PAGE 8

Replay

PAGE 9

Replay

PAGE 10

A B C D

Replay

PAGE 11

A E D

Replay can detect problems

PAGE 12

AC D

Problem!missing token

Problem!token left behind

Conformance Checking (WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)

PAGE 13

Replay can extract timing information

PAGE 14

A5B8 C9 D13

5

8

9

13

3

4

5

43

265

8

764

7

74

3

PAGE 15

Performance Analysis Using Replay(WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)

PAGE 16

Big Data

Big Data

PAGE 17

Source: “Big Data: The Next Frontier for Innovation, Competition, and Productivity” McKinsey Global Institute, 2011.

“Enterprises globally stored more than 7 exabytesof new data on disk drives in 2010, while consumers stored morethan 6 exabytes of new data on devices such as PCs and notebooks.”

“All of the world's music can be stored

on a $600 disk drive.”

“Indeed, we are generating so much data today that it is

physically impossible to store it all. Health

care providers, for instance, discard 90

percent of the data that they generate.”

PAGE 18

Hilbert and Lopez. The World's Technological Capacity to Store, Communicate, and Compute Information. Science, 332(6025):60-65, 2011.

PAGE 19

www.olifantenpaadjes.nl

PAGE 20

PAGE 21

Evidence-Based Computer Science

PAGE 22

PAGE 23

How good is my model?

Four Competing Quality Criteria

PAGE 24

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Example: one log four models

PAGE 25

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

N2 : fitness = -, precision = +, generalization = -, simplicity = +

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

(all 21 variants seen in the log)

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

process discovery

fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Model N1

PAGE 26

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

Model N2

PAGE 27

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

Model N3

PAGE 28

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

Model N4

PAGE 29

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

(all 21 variants seen in the log)

PAGE 30

Process Discovery

Process Discovery (small selection)

PAGE 31

α algorithm

α++ algorithm

α# algorithm

language-based regions

state-based regionsgenetic mining

heuristic mining

hidden Markov models

neural networks

automata-based learning

stochastic task graphs

conformal process graph

mining block structures

multi-phase miningpartial-order based mining

fuzzy mining

LTL mining

ILP mining

distributed genetic mining

Petri net view: Just discover the places …

PAGE 32

Adding a place limits behavior:• overfitting ≈ adding too many places • underfitting ≈ adding too few places

Example: Process Discovery UsingState-Based Regions

PAGE 33

[ ]

a

d

b dc

e

cb

[a,b,c,d][a,b,c][a,c]

[ a,b][a,e] [a,d,e]

[a]

a

b

c

de

p2

end

p4

p3p1

start

Example of Region

PAGE 34

[ ]

a

d

b dc

e

cb

[a,b,c,d][a,b,c][a,c]

[ a,b][a,e] [a,d,e]

[a]

a

b

c

de

p2

end

p4

p3p1

start

enter: b,eleave: ddo-not-cross: a,c

Example: Process Discovery UsingLanguage-Based Regions

PAGE 35

R

A place is feasible if it can be added without disabling any of thetraces in the event log.

PAGE 36

Conformance Checking

Replaying trace “abeg”

PAGE 37

m=1r=1

a b e g

6

11

6= 0.83333

Can be lifted to log level

PAGE 38

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

p3p1

p2 p4

N1

p5

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

From “playing the token game” to optimal alignments …

a b » e ga b d e g

PAGE 39

observed trace: “abeg”

move in model only

Another alignment

a b c d e ga b » d e g

PAGE 40

observed trace: “abcdeg”

move in log only

Moves in an alignment

PAGE 41

a b » d e ga » c d e g

trace in event log

possible run of model

move in log

move in model move in both

Optimal alignment describes modeled behavior closest to observed behavior

Moves have costs

• Standard cost function:

−c(x,») = 1−c(»,y) = 1−c(x,y) = 0, if x=y−c(x,y) = ∞, if x≠y

PAGE 42

… a …… » …

… » …… a …

… a …… a …

… a …… b …

Non-fitting trace: abefdeg

PAGE 43

abefdeg

a b » e f d » e ga b d e f d b e g 2

2a b e f d e ga b » » d e g

Any cost structure is possible

PAGE 44

… send-letter(John,2 weeks, $400)

… send-email(Sue,3weeks,$500)

• Similar activities (more similarity implies lower costs).• Resource conformance (done by someone that does

not have the specified role).• Data conformance (path is not possible for this

customer). • Time conformance (missed the legal deadline)

Fitness

PAGE 45

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

N2 : fitness = -, precision = +, generalization = -, simplicity = +

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

(all 21 variants seen in the log)

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

1.0

0.8

1.0

1.0

Our A* algorithm exploits the Petri net marking equation and uses other “tricks” to prune the search space.

Aligned event log is starting point for other types of analysis.

PAGE 46

Distributing/Decomposing “Big Data” Process Mining Problems

PAGE 47

What if?

PAGE 48

book car

c

add extra insurance

d change booking

e

confirm initiate check-in

j

check driver’s license

k

charge credit card

i

select car

g

supply car

in

a

b

skip extrainsurance

f

h

add extra insurance

skip extra insurance

l

out

c1 c2

c3

c4

c5

c6

c7

c8

c9

c10

c11

acefgijklacddefhkjilabdefjkgil

acdddefkhijlacefgijklabefgjikl

...

processdiscovery

conformance checking

there are more than 1000 different

activities?

there are more than 1.000.000

cases?

there are more than 100.000.000

events?

Distributed computing

• multicore CPU• manycore GPU• cluster computing• grid computing• cloud computing• …

PAGE 49

How to distribute process discovery?

PAGE 50

How to distribute conformance checking?

PAGE 51

abcdegadcefbcfdeg

abdcegabcdefbcdegabdfcefdcegacdefbdceg

abcdegabdceg

abdcefbdcefbdcegabcdeg

abcdefbcdefbdcegabcdefbdceg

acdefgadcfeg

abdcefcdfegabcdeg

a b

c

d

e

c1in

c2

c3

c4

c5

g

outc6

f

abcdegabdceg

abcdefbcdegabcdegabdceg

abdcefbdcefbdcegabcdeg

abcdefbcdefbdcegabcdefbdceg

abcdeg

a b

c

d

e

c1in

c2

c3

c4

c5

g

outc6

f

adcefbcfdegabdfcefdcegacdefbdceg

acdefgadcfeg

abdcefcdfeg

b is often skipped

f occurs too often

Classification based on partitioning of event log: vertical and horizontal

PAGE 52

sets of cases

sets of activities

Replication: Same event log on all computing nodes

PAGE 53

Only makes sense if random elements, e.g., genetic process mining.

Vertical distribution I: Split cases arbitrarily

PAGE 54

abcdegabdcefbcdeg

abdcegabcdefbcdegabdcefbdcegabcdefbdceg

abcdegabdceg

abdcefbdcefbdcegabcdeg

abcdefbcdefbdcegabcdefbdceg

abcdegabdceg

abdcefbcdegabcdeg

abcdegabdcefbcdeg

abdcegabcdefbcdegabdcefbdcegabcdefbdceg

abcdegabdceg

abdcefbdcefbdcegabcdeg

abcdefbcdefbdcegabcdefbdceg

abcdegabdceg

abdcefbcdegabcdeg

sets of cases

Vertical distribution II: Split cases based on a specific feature

PAGE 55

abcdegabdcefbcdeg

abdcegabcdefbcdegabdcefbdcegabcdefbdceg

abcdegabdceg

abdcefbdcefbdcegabcdeg

abcdefbcdefbdcegabcdefbdceg

abcdegabdceg

abdcefbcdegabcdeg

abcdegabdcegabcdegabdcegabcdegabcdegabdcegabcdeg

abdcefbcdegabcdefbcdegabdcefbdcegabcdefbdcegabcdefbdcegabdcefbcdeg

abdcefbdcefbdcegabcdefbcdefbdceg

Horizontal distribution

PAGE 56

sets of activities

Horizontal distribution: The key idea

PAGE 57

projected on a,b,e,f,g

projected on b,c,d,e

PAGE 58

Passages for Horizontal Distribution

Passages

PAGE 59

Passage P=(X,Y)

PAGE 60

causal dependency: may trigger or enable

Minimal passages

PAGE 61

a passage is minimal if it does not contain smaller passages

Passages define an equivalence relation on the edges in the graph

PAGE 62

Minimal passage 1: ({a},{b,c})

PAGE 63

Minimal passage 2: ({b,c,d},{d,e,f})

PAGE 64

Minimal passage 3: ({e},{g})

PAGE 65

Minimal passage 4: ({f},{h})

PAGE 66

Minimal passage 5: ({g,h},{i})

PAGE 67

So What?

• Any process model can be partitioned in minimal passages.

• Discovery and conformance checking can be done per passage!

PAGE 68

b

a

c

d

e

j

i

hf nk

l

m

o

pgi o

clouds may contain arbitrary subprocesses not explicitly recorded in the

event log (invisible activities or small networks used for routing, e.g. XOR/AND/OR-

split/joins)

Example result for Petri nets

PAGE 69

“The event log fits all passages if and only if the event log fits the whole model.”

b

a

c

d

e

j

i

hf nk

l

m

o

pgi o

Key insight: interface transitions controlled by event log

causal structure obtained using heuristics & domain knowledge

Discovery example

PAGE 70

a

in

g

out

b

f

a

b

c

d

c

d

e

f

e

g

a b

c

d

e

c1in

c2

c3

c4

c5

g

outc6

f

Conformance checking

PAGE 71

book car

c

add extra insurance

d change booking

e

confirm initiate check-in

j

check driver’s license

k

charge credit card

i

select car

g

supply car

in

a

b

skip extrainsurance

f

h

add extra insurance

skip extra insurance

l

out

c1 c2

c3

c4

c5

c6

c7

c8

c9

c10

c11

aceflacddeflabdefl

acdddeflaceflabefl

...

Create Skeleton

PAGE 72

Net fragments per passage

PAGE 73

Initial implementation in ProM

PAGE 74

Super linear speedups possible (even when using a single computer decomposition helps)

PAGE 75

PAGE 76

Conclusion

Conclusion

PAGE 77“Big Data”

b

a

c

d

e

j

i

hf nk

l

m

o

pgi o

PAGE 78

www.processmining.org

www.win.tue.nl/ieeetfpm/