Novatica 223 - Process Mining (English Edition)

http://www.gedos.es

NováticaNováticaNováticaNováticaNovática, founded in 1975, is the oldest periodical publicationamongst those especialized in Information and CommunicationTechnology (ICT) existing today in Spain. It is published by ATIATIATIATIATI(((((Asociación de Técnicos de InformáticaAsociación de Técnicos de InformáticaAsociación de Técnicos de InformáticaAsociación de Técnicos de InformáticaAsociación de Técnicos de Informática) ) ) ) ) which alsopublishes REICIS REICIS REICIS REICIS REICIS (Revista Española de Inovación, Calidad eIngeniería del Software).

<http://www.ati.es/novatica/><http://www.ati.es/reicis/>

ATI ATI ATI ATI ATI is a founding member of CEPIS CEPIS CEPIS CEPIS CEPIS (Council of EuropeanProfessional Informatics Societies), the Spain´s representativein IFIP IFIP IFIP IFIP IFIP (International Federation for Information Processing),and a member of CLEI CLEI CLEI CLEI CLEI (Centro Latinoamericano de Estudios enInformática) and CECUACECUACECUACECUACECUA (Confederation of EuropeanComputerUser Associations). It has a collaboration agreement with ACMACMACMACMACM(Association for Computing Machinery) as well as with diverseSpanish organisations in the ICT field.Editorial BoardGuillem Alsina González, Rafael Fernández Calvo (presidente del Consejo), Jaime Fernández Martínez,Luís Fernández Sanz, José Antonio Gutiérrez de Mesa, Silvia Leal Martín, Dídac López Viñas, FrancescNoguera Puig, Joan Antoni Pastor Collado, Víktu Pons i Colomer, Moisés Robles Gener, Cristina Vigil Díaz,Juan Carlos Vigo López

Chief EditorLlorenç Pagés Casas <[email protected]>LayoutJorge Llácer Gil de RamalesTranslationsGrupo de Lengua e Informática de ATI <http://www.ati.es/gt/lengua-informatica/>AdministrationTomás Brunete, María José Fernández, Enric Camarero

Section Editors

Artificial IntelligenceVicente Botti Navarro, Julián Inglada (DSIC-UPV), <{vbotti,viglada}@dsic.upv.esComputational LinguisticsXavier Gómez Guinovart (Univ. de Vigo), <[email protected]>Manuel Palomar (Univ. de Alicante), <[email protected]>Computer ArchiectureEnrique F. Torres Moreno (Universidad de Zaragoza), <[email protected]>José Flich Cardo (Universidad Politécnica de Valencia, <[email protected]>Computer GraphicsMiguel Chover Sellés (Universitat Jaume I de Castellón), <[email protected]>Roberto Vivó Hernando (Eurographics, sección española), <[email protected]>Computer LanguagesÓscar Belmonte Fernández (Univ. Jaime I de Castellón), <[email protected]>Inmaculada Coma Tatay (Univ. de Valencia), <[email protected]>e-GovernmentFrancisco López Crespo (MAE), <[email protected]>Sebastià Justicia Pérez (Diputación de Barcelona) <[email protected]>Free SoftwareJesús M. González Barahona (GSYC - URJC), <[email protected]>Israel Herráiz Tabernero (Universidad Politéncia de Madrid), <[email protected]>Human -Computer InteractionPedro M. Latorre Andrés (Universidad de Zaragoza, AIPO), <[email protected]>Francisco L. Gutierrez Vela (Universidad de Granada, AIPO), <[email protected]>ICT and TourismAndrés Aguayo Maldonado, Antonio Guevara Plaza (Universidad de Málaga),<{aguayo, guevara}@lcc.uma.es>Informatics and PhilosophyJosé Angel Olivas Varela (Escuela Superior de Informática, UCLM), <[email protected]>Roberto Feltrero Oreja (UNED), <[email protected]>Informatics ProfessionRafael Fernández Calvo (ATI), [email protected]>, Miquel Sarriès Griñó (ATI), <[email protected]>Information Access and RetrievalJosé María Gómez Hidalgo (Optenet), <[email protected]>Enrique Puertas Sanz (Universidad Europea de Madrid), <[email protected]>Information Systems AuditingMarina Touriño Troitiño, <[email protected]>Sergio Gómez-Landero Pérez (Endesa), <[email protected]>IT GovernanceManuel Palao García-Suelto (ATI), <[email protected]>,Miguel García-Menéndez (iTTi) <[email protected]>Knowledge ManagementJoan Baiget Solé (Cap Gemini Ernst & Young), <[email protected]>Language and InformaticsM. del Carmen Ugarte García (ATI), <[email protected]>Law and TecnologyIsabel Hernando Collazos (Fac. Derecho de Donostia, UPV), <[email protected]>Elena Davara Fernández de Marcos (Davara & Davara), <[email protected]>Networking and Telematic ServicesJuan Carlos López López (UCLM), <[email protected]>Ana Pont Sanjuán (UPV), <[email protected]>Personal Digital EnvironmentAndrés Marín López (Univ. Carlos III), <[email protected]>Diego Gachet Páez (Universidad Europea de Madrid), <[email protected]>Software ModelingJesus García Molina (DIS-UM), <[email protected]>Gustavo Rossi (LIFIA-UNLP, Argentina), <[email protected]>Students´ WorldFederico G. Mon Trotti (RITSI), <[email protected]>Mikel Salazar Peña (Area de Jovenes Profesionales, Junta de ATI Madrid), <[email protected]>Real Time SystemsAlejandro Alonso Muñoz, Juan Antonio de la Puente Alfaro (DIT-UPM),<{aalonso,jpuente}@dit.upm.es>RoboticsJosé Cortés Arenas (Sopra Group), <[email protected]>Juan González Gómez (Universidad Carlos III ), <[email protected] Areitio Bertolín (Univ. de Deusto), <[email protected]>Javier López Muñoz (ETSI Informática-UMA), <[email protected]>Software EngineeringLuis Fernández Sanz, Daniel Rodríguez García (Universidad de Alcalá),<{luis.fernandezs,daniel.rodriguez}@uah.es>Tecnologies and BusinessDidac López Viñas (Universitat de Girona), <[email protected]>Alonso Álvarez García (TID) <[email protected]>Tecnologies for EducationJuan Manuel Dodero Beardo (UC3M), <[email protected]>César Pablo Córcoles Briongo (UOC), <[email protected]>.Teaching of Computer ScienceCristóbal Pareja Flores (DSIP-UCM), <[email protected]>J. Ángel Velázquez Iturbide (DLSI I, URJC), [email protected]>Technological TrendsJuan Carlos Vigo (ATI), <[email protected]>Gabriel Martí Fuentes (Interbits), <[email protected]>Web StandardsEncarna Quesada Ruiz (Virati) <[email protected]>José Carlos del Arco Prieto (TCP Sistemas e Ingeniería), <[email protected]>)

CopyrightCopyrightCopyrightCopyrightCopyright © ATI 2014© ATI 2014© ATI 2014© ATI 2014© ATI 2014The opinions expressed by the autors are their exclusive responsability

Editorial Office, Advertising and Madrid OfficePlaza de España 6, 2ª planta, 28008 MadridTlfn.914029391; fax.913093685 <[email protected]>Layout and Comunidad Valenciana OfficeAv. del Reino de Valencia 23, 46005 ValenciaTlfn. 963740173 <[email protected]>Acounting, Subscriptions and Catalonia OfficeCalle Ävila 48-50, 3a planta, local 9, 08005 BarcelonaTlfn.934125235; fax 934127713 <[email protected]>Andalucía OfficeOfficeOfficeOfficeOffice <[email protected]>Galicia OfficeOfficeOfficeOfficeOffice<[email protected]>Subscriptións and Sales <[email protected]>Advertising Plaza de España 6, 2ª planta, 28008 MadridTlnf.914029391; fax.913093685 <[email protected]>Legal depósit: B 15.154-1975 -- ISSN: 0211-2124; CODEN NOVAECCover Page: "Mineral, Vegetable, Animal" - Concha Arias Pérez / © ATILayout Dising: Fernando Agresta / © ATI 2003

Special English Edition 2013-2014 Special English Edition 2013-2014 Special English Edition 2013-2014 Special English Edition 2013-2014 Special English Edition 2013-2014 Annual Selection of Articles Annual Selection of Articles Annual Selection of Articles Annual Selection of Articles Annual Selection of Articles

editorialATI: Boosting the Future > 02From the Chief Editor´PenProcess Mining: Taking Advantage of Information Overload > 02Llorenç Pagés Casas

monographProcess MiningGuest Editors: Antonio Valle-Salas and Anne Rozinat

Presentation. Introduction to Process Mining > 04Antonio Valle-Salas, Anne Rozinat

Process Mining: The Objectification of Gut Instinct - Making BusinessProcesses More Transparent Through Data Analysis > 06Anne Rozinat, Wil van der Aalst

Process Mining: X-Ray Your Business Processes > 10Wil van der Aalst

The Process Discovery Journey > 18Josep Carmona

Using Process Mining in ITSM > 22Antonio Valle-Salas

Process Mining-Driven Optimization of a Consumer Loan Approvals Process > 30Arjel Bautista, Lalit Wangikar, S.M. Kumail Akbar

Detection of Temporal Changes in Business ProcessesUsing Clustering Techniques > 39Daniela Luengo, Marcos Sepúlveda

summary

novática Special English Edition - 2013-2014 Annual Selection of Articles2

ATI: Boosting the FutureATI: Boosting the FutureATI: Boosting the FutureATI: Boosting the FutureATI: Boosting the Future

editorial

From the Chief Editor´s Pen

Process Mining: Taking Advantage of Information Overload

In its 2014 General Assembly, ATI, the 47-yearold Spanish association which publishesNovática,Novática,Novática,Novática,Novática, set the foundations for a newstage in the development of its activities. Thebig changes occurred in the profession’sstructure and the huge advances in technologyin the fields of information and communicationmake necessary a transition towards a newmodel in the functioning of our associationwhich combines agility, efficiency and newways of participation for our members. Withspecial mention of our aim of internationalreach and presence reinforced with the recentapproval of a new category of internationalmembership at ATI.

The action plan approved in the GeneralAssembly includes as the main priorities:

InternationalizationInternationalizationInternationalizationInternationalizationInternationalization: This is going tobe one of our cornerstones for the futurewith a special focus on ATI’s expansion inLatin American Spanish speaking countries. Services for MembersServices for MembersServices for MembersServices for MembersServices for Members: ATI memberswill continue to be the centre of our attention.

For this purpose we are planning to create avariety of offers of professional activities andtraining intended to be attractive both in contentand cost. All of them having the ATI trust markas a guarantee. Brand and CommunicationBrand and CommunicationBrand and CommunicationBrand and CommunicationBrand and Communication: We aimto further develop and enhance ourcommunication channels. That’s to say thenewsletter we publish on a weekly basis,increasing participation in social networksas well as promotion and spreading of ouractivities. With a special emphasis onNováticaNováticaNováticaNováticaNovática, the improvement of its digitalversion and the search for collaborativepartnerships to increase its spread. SustainabilitySustainabilitySustainabilitySustainabilitySustainability: We will continue tostreamline our financial structure to ensurethe viability of our activities by creating amodel which allows us to keep them attractivewithout a further impact on costs andmembership fees. ProfessionalismProfessionalismProfessionalismProfessionalismProfessionalism: We are planning todefine a structure and concept of the ITprofession and of the quality of itsdevelopment as a framework for our

international activities. NetworkingNetworkingNetworkingNetworkingNetworking: This is without a doubtone of our main reasons for the existence ofour association and therefore we shouldkeep improving our capabilities to establishrelationships among our members based ontheir interests and preferences. Changes in ATI BylawsChanges in ATI BylawsChanges in ATI BylawsChanges in ATI BylawsChanges in ATI Bylaws: We willcreate a working group to prepare a proposalfor a new associative framework that alignswith the aforementioned goals of gainingrelevance and influence in national andinternational contexts whereas ourorganization becomes more agile andresponsive to IT professionals.

To sum up, we can say that after our GeneralAssembly, ATI is fully prepared to continuewith its role of being the IT associationleader in Spain representing ourprofessionals worldwide.

Dìdac López ViñasDìdac López ViñasDìdac López ViñasDìdac López ViñasDìdac López ViñasPresident of ATIPresident of ATIPresident of ATIPresident of ATIPresident of ATI

In the so called "information era" almost everysimple event can be digitized. In fact, speakingin business terms, we could say that all ofthem will be digitized without a doubt becauseit is easy and cheap.

But, is it really cheap to do that? Notnecessarily. Being able to browse throughthousands of millions of event records is notsynonymous with success in business. Bycontrast, it can lead you to failure if you don’thave appropriate tools to manage the glut ofinformation.

In other words, we are not going here todiscuss the value of having that plethora ofinformation but how to take advantage of itwhile avoiding its associated problems.

Data Mining, Business Intelligence are (relatively)new approaches to tackle this issue and we havebroadly covered these topics in NováticaNováticaNováticaNováticaNovática in thelast years. Nevertheless, let me say that "miningdata" or "applying intelligence to business" soundlike incomplete approaches. "Do you aim toextract just ‘data’ from all that information?"you could object to the former solution. "Youapply ‘intelligence’ to business, but how

systematically?" you could argue with thepartisans of the later.

In that context, Process Mining appears tobe the missing link or at least one of themost important of them. This is the feelingyou get when you are introduced to this newdiscipline in business terms. Indeed it looksplausible to think that all business eventswhose historical data are being stored on adaily basis respond to business models andpatterns. And, besides, it looks really feasibleto think that artificial intelligence techniquescould help you to surf into how your datais connected in order to discover processmodels and patterns you are not aware of.

The deeper you go into the subject the moreyou feel this is definitely a very promissoryapproach. Especially considering that in thebusiness arena the concept of "process" ismore and more placed at a higher level than"data" or even "data collection". Nowadays,the emergent trend to set out corporategovernance goals and business compliancerequirements use the concept of ‘businessprocess’ as their elementary unit of work.

All those reasons have led us to select thismonograph entitled "Process MiningProcess MiningProcess MiningProcess MiningProcess Mining" whoseguest editors have been Antonio Valle-SalasAntonio Valle-SalasAntonio Valle-SalasAntonio Valle-SalasAntonio Valle-Salas(Managing Partner of G2) and Anne RozinatAnne RozinatAnne RozinatAnne RozinatAnne Rozinat(Co-founder of Fluxicon) for the present SpecialEnglish Edition of NováticaNováticaNováticaNováticaNovática.

We thank very much to the guest editors andthe authors for their highly valuablecontribution to the monograph. Sincere thanksmust be extended to Arthur CookArthur CookArthur CookArthur CookArthur Cook, JimJimJimJimJimHolderHolderHolderHolderHolder, Roger Shlomo-HarrisRoger Shlomo-HarrisRoger Shlomo-HarrisRoger Shlomo-HarrisRoger Shlomo-Harris andWilliam CentrellaWilliam CentrellaWilliam CentrellaWilliam CentrellaWilliam Centrella for their excellent workin the edition of the articles.

I hope you enjoy this issue as much as we didduring our process of edition.

Llorenç Pagés CasasLlorenç Pagés CasasLlorenç Pagés CasasLlorenç Pagés CasasLlorenç Pagés CasasChief Editor of NováticaChief Editor of NováticaChief Editor of NováticaChief Editor of NováticaChief Editor of Novática

International Federation for

Information Processing

Hofstrasse 3, A-2361 Laxenburg, Austria

Phone: +43 2236 73616 Fax: +43 2236 736169

E-mail: [email protected] http://www.ifip.org

IFIP is the global forum of ICT experts that plays an active role

in developing and offering expertise to all stakeholders of the

information and knowledge society

4 novática Special English Edition - 2013/2014 Annual Selection of Articles

monograph Process Mining

monograph

Guest Editors

Presentation. Introductionto Process Mining

During the last few decades informationtechnology has touched every part of ourlives. From cellular phones to the mostadvanced medical information processingsystems, vending machines and PLCs inproduction lines, computerized componentsare everywhere. All these componentsgenerate vast amounts of information thatare growing exponentially, Relatively fewyears ago the challenge was finding digitizedinformation, whereas now the problem isbeing able to process and give meaning to allthe information we generate.

In recent years we have seen how theinformation analysis industry has proposedvarious approaches to this problem. Someof them have been covered in one way oranother in previous editions of NováticaNováticaNováticaNováticaNovática:starting with VLDB (Very Large Databases)in volume 91 (1991) and Datawarehouseapproaches attempting to discover patternsin these data stores with Data Mining involume 138 (1999), then followedKnowledge Management in volume 155(2002). We realized how complex theproblem was in the monograph on TheInternet of Things in volume 209 (2011)and how we could exploit this information involume 211 on Business Intelligence (2011).Finally, the industry is also moving in adirection not yet covered in NováticaNováticaNováticaNováticaNovática butcertain to be addressed in the near future:Big Data.

In the present volume of NováticaNováticaNováticaNováticaNovática weaddress a particularly interesting topic withinthis broad range of techniques for dataanalysis: Process Mining..... This is a variant ofdata mining in which we focus on analyzingthe information generated by the processesthat have been computerized and whoseexecutions have been traced. As AnneAnneAnneAnneAnneRozinat Rozinat Rozinat Rozinat Rozinat and Wil van der AalstWil van der AalstWil van der AalstWil van der AalstWil van der Aalst explain inthe opening article, we will see that the firsttraces are found in the late nineteenth century,although in terms of modern science werefer to the seminal work of Myhill / Nerod(1958), or Viterbi algorithms (1978).

In the late 90s there were already some specificresearch teams in universities around theworld, especially the University of Coloradoand Technische Universiteit Eindhoven.These teams developed their researchdefining algorithms and methods that allowthe treatment of process execution traces for

discovery, analysis and representation of theunderlying processes.

However no tools that implemented thesealgorithms with appropriate degrees ofusability had yet reached the market. By theend of 2003 the processmining.orgspecialized community (a working group ofthe TU/e) was created, and in early 2004 thefirst version of ProM was developed, a genericand open source framework for processmining that has become the primary tool forresearchers and analysts, now at version 6.3and including more than 500 plugins thatimplement state of the art in this field.

In 2009 a Task Force of the IEEE focusedon process mining was created that now hasmembers from over 20 countries includingsoftware vendors (such as Software AG,HP, IBM or Fluxicon, among many others),consulting firms and analysts (ProcessSphere, Gartner and Deloitte, among others)and a wide range of educational and researchinstitutions (TU/e, Universitat Politècnicade Catalunya or Universität zu Berlin, toname but a few). One of the key objectivesof this task force is to spread the concepts,techniques and benefits of process mining.In 2011 they published the Process MiningManifesto, a document signed by more than50 professionals translated into 12languages. We cannot include here the fulltext of the manifesto, but the reader will findthe reference in the links section of thismonograph.

For this present edition of NováticaNováticaNováticaNováticaNovática wehave been privileged to have a group ofauthors that give us different perspectiveson the matter. We begin with an introductoryarticle in which Anne RozinatAnne RozinatAnne RozinatAnne RozinatAnne Rozinat and Wil vanWil vanWil vanWil vanWil vander Aalstder Aalstder Aalstder Aalstder Aalst set the context for process miningconcepts and state, in a very enlighteningprocess mining message, that it allows us toget an objective vision of our processes.

In the second paper Wil van der AalstWil van der AalstWil van der AalstWil van der AalstWil van der Aalstguides us through the different uses we canmake of process mining: to create of a modelof the process, to check the compliance ofthe model or to improve an existing model.Here another key message is presented: theuse of process mining as X-rays that allow usto see "inside" the process, based on theanalysis of real data from the execution of allcases (as opposed to the statistical samplingwe would do in an audit, for example).

In the next article you will find the JosepJosepJosepJosepJosepCarmonaCarmonaCarmonaCarmonaCarmona’s vision of the task of discoveringa process from its traces. Here Josep makesan entertaining approach to how we coulduse the mining process to decrypt themessage of an alien explaining his visit toplanet Earth, while showing us the anatomyof the discovery process.

The introductory papers will be followed bya set of articles focusing on case studies.First Antonio Valle-SalasAntonio Valle-SalasAntonio Valle-SalasAntonio Valle-SalasAntonio Valle-Salas’’’’’ article presentsan application of process mining in a specificindustry, focusing on the processes in an IT

Antonio Valle-Salas1, AnneRozinat2

Managing Partner of G2; Co-founder ofFluxicon

<[email protected]>, <[email protected]><[email protected]>, <[email protected]><[email protected]>, <[email protected]><[email protected]>, <[email protected]><[email protected]>, <[email protected]>

Antonio Valle-Salas is Managing Partner of G2 and a specialist consultant in ITSM (InformationTechnology Service Management) and IT Governance. He graduated as a Technical Engineer inManagement Informatics from UPC (Universitat Politécnica de Catalunya) and holds a number ofmethodology certifications such as ITIL Service Manager from EXIN (Examination Institute for InformationScience), Certified Information Systems Auditor (CISA) from ISACA, and COBIT Based IT GovernanceFoundations from IT Governance Network, plus more technical certifications in the HP Openview familyof management tools. He is a regular collaborator with itSMF (IT Service Management Forum) Spainand its Catalan chapter, and combines consulting and project implementation activities with frequentcollaborations in educational activities in a university setting (such as UPC or the Universitat PompeuFabra) and in the world of publishing in which he has collaborated on such publications as IT Governance:a Pocket Guide, Metrics in IT Service Organizations, Gestión de Servicios TI. Una introducción a ITIL,and the translations into Spanish of the books ITIL V2 Service Support and ITIL V2 Service Delivery.

Anne Rozinat has more than ten years of experience with process mining technology and obtained herPhD cum laude in the process mining group of Prof. Wil van der Aalst at the Eindhoven University ofTechnology in the Netherlands. Currently, she is a co-founder of Fluxicon and blogs at <http://www.fluxicon.com/blog/>.

5

Process Mining monograph

monograph novática Special English Edition - 2013/2014 Annual Selection of Articles

Useful References of "Process Mining"Useful References of "Process Mining"Useful References of "Process Mining"Useful References of "Process Mining"Useful References of "Process Mining"

Department and showing the different uses wecan make of these techniques in the world ofIT Service Management (ITSM)

Then Arjel D. BautistaArjel D. BautistaArjel D. BautistaArjel D. BautistaArjel D. Bautista, Lalit M.Lalit M.Lalit M.Lalit M.Lalit M.WangikarWangikarWangikarWangikarWangikar and Syed Kumail AkbarSyed Kumail AkbarSyed Kumail AkbarSyed Kumail AkbarSyed Kumail Akbar presentthe work done to optimize the loan approvalprocess of a Dutch bank, a remarkable workthat was awarded the BPI Challenge 2012prize.

Finally Daniela LuengoDaniela LuengoDaniela LuengoDaniela LuengoDaniela Luengo and MarcosMarcosMarcosMarcosMarcosSepúlvedaSepúlvedaSepúlvedaSepúlvedaSepúlveda give us a research perspectiveon one of the challenges stated in themanifesto: dealing with the concept drift.This term is used to refer to the situation inwhich the process is changing while it isbeing used. Detecting these changes andincluding these features in the analysis isessential when working on rapidly changingenvironments because, otherwise, it couldlead to erroneous conclusions in analysis.

These authors have contributed with theirarticles to give a clearer vision of what processmining is, what it is useful for and what itsfuture is. Process mining is a relatively newscience but is already reaching the level ofmaturity required to become standardpractice in companies and organizations, asreflected in the articles’ practical uses.However, there are still many challengesahead and a long way to go: Will we be ableto overcome the problems introduced by theconcept drift? Can we use process miningnot only for knowing the past of a processbut also to predict its future? Will weimplement these techniques in themanagement systems of business processesin order to provide them with predictivesystems or support operators?

We are sure we will see great advances in thisarea in the near future.

In addition to the materials referenced by theauthors in their articles, we offer the followingones for those who wish to dig deeper into thetopics covered by the monograph:W.M.P. van der Aalst. W.M.P. van der Aalst. W.M.P. van der Aalst. W.M.P. van der Aalst. W.M.P. van der Aalst. Process Mining:Discovery, Conformance and Enhancementof Business Processes. Springer Verlag, 2011.ISBN 978-3-642-19344-6. IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.Process Mining Manifesto (en 12 idiomas).< h t t p : / / w w w . w i n . t u e . n l / i e e e t f p m /doku.php?id=shared: process_mining_manifesto>.

Fluxicon TU/eProcess Mining Group.Fluxicon TU/eProcess Mining Group.Fluxicon TU/eProcess Mining Group.Fluxicon TU/eProcess Mining Group.Fluxicon TU/eProcess Mining Group.Introduction to Process Mining: turning (big)data into value (video). <http://www.youtube.com/watch?v=7oat7MatU_U>.Fluxicon.Fluxicon.Fluxicon.Fluxicon.Fluxicon. Process Mining News. <http://fluxicon.com/s/newsarchive>. TU/e Workgroup.TU/e Workgroup.TU/e Workgroup.TU/e Workgroup.TU/e Workgroup. <http://www.processmining.org>.FluxiconFluxiconFluxiconFluxiconFluxicon..... Process Mining Blog. <http://fluxicon.com/blog/>. IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.IEEE Task Force on Process Mining.<http://www.win.tue.nl/ieeetfpm/doku.php?id=start>.

LinkedIn. LinkedIn. LinkedIn. LinkedIn. LinkedIn. Process Mining ( ( ( ( (community)<http://www.linkedin.com/groups/Process-Mining-1915049>. TU/e.TU/e.TU/e.TU/e.TU/e. Health Analytics Using ProcessMining. <http://www.healthcare-analytics-process-mining.org>.



monograph

1. IntroductionThe archive of the United States NavalObservatory stored all the naval logbooks ofthe US Navy in the 19th century. Theselogbooks contained daily entries relating toposition, winds, currents and other details ofthousands of sea voyages. These logbookslay ignored and it had even been suggestedthat they be thrown away until MathewFontaine Maury came along.

Maury (see Figure 1) was a sailor in the USNavy and from 1842 was the director of theUnited States Naval Observatory. Heevaluated the data systematically and createdillustrated handbooks which visually mappedthe winds and currents of the oceans. Thesewere able to serve ships’ captains as adecision-making aid when they were planningtheir route.

In 1848 Captain Jackson of the W. H. D. C.Wright was one of the first users of Maury’shandbooks on a trip from Baltimore to Riode Janeiro and returned more than a monthearlier than planned. After only seven yearsfrom the production of the first editionMaury’s Sailing Directions had saved thesailing industry worldwide about 10 milliondollars per year [1].

The IT systems in businesses also concealinvaluable data, which often remainscompletely unused. Business processescreate the modern day equivalent of "logbookentries", which detail exactly which activitieswere carried out when and by whom, (seeFigure 2). If, for example, a purchasingprocess is started in an SAP system, everystep in the process is indicated in thecorresponding SAP tables. Similarly, CRMsystems, ticketing systems and even legacysystems record historical data about theprocesses.

These digital traces are the byproduct of theincreasing automation and IT support ofbusiness processes [2].

2. From Random Samples toComprehensive AnalysisBefore Maury’s manual on currents and tides,sailors were restricted to planning a routebased solely on their own experience. This isalso the case for most business processes:Nobody really has a clear overview of howthe processes are actually executed. Instead,

Figure 1. Matthew Fontaine Maury (Source: Wikipedia).

Process Mining: The Objectificationof Gut Instinct - Making Business

Processes More TransparentThrough Data Analysis

Anne Rozinat1, Wil van derAalst2

1Co-founder of Fluxicon, The Netherlands;2Technical University of Eindhoven, TheNetherlands

<a n n e @fl uxic on . com >,<w. m.p .v . d .aals t@tue . n l>

Abstract: Big Data existed in the 19th Century. At least that might be the conclusion you would draw byreading the story of Matthew Maury. We draw a parallel with the first systematic evaluations ofseafaring logbooks and we show how you can quickly and objectively map processes based on theevaluation of log files in IT systems.

Keywords: Big Data, Case Study, Log Data, Process Mining, Process Models, Process Visualization,Systematic Analysis.

Authors

Anne Rozinat has more than eight years of experience with process mining technology and obtainedher PhD cum laude in the process mining group of Prof. Wil van der Aalst at the Eindhoven University ofTechnology in the Netherlands. Currently, she is a co-founder of Fluxicon and blogs at <http://www.fluxicon.com/blog/>.

Wil van der Aalst is a professor at the Technical University in Eindhoven and with an H-index of over 90points the most cited computer scientist in Europe. Well known through his work on the WorkflowPatterns, he is the widely recognized "godfather" of process mining. His personal website is <http://www.vdaalst.com>.

there are anecdotes, good feeling and manysubjective (potentially contradicting)opinions which have to be reconciled.

The systematic analysis of digital log tracesthrough so-called Process Mining techniques[3] offers enormous potential for all

mailto:t@tue.

http://www.fluxicon.com/blog/

http://www.vdaalst.com

7



Figure 2. IT-supported Processes Record in Detail Which Activities WereExecuted When and by Whom.

organizations currently struggling withcomplex processes. Through an analysis ofthe sequence of events and their time stamps,the actual processes can be fully andobjectively reconstructed and weaknessesuncovered. The information in the IT logscan be used to automatically generate processmodels, which can then be further enrichedby process metrics also extracted directlyout of the log data (for example executiontimes and waiting times).

Typical questions that Process Mining cananswer are:What does my process actually look like?Where are the bottlenecks?Are there deviations from the prescribedor described process?

In order to optimize a process, one must firstunderstand the current process reality - the‘As-is’ process. This is usually far fromsimple, because business processes are

The manual discovery through classical workshopsand interviews is costly and time-consuming,

remaining incomplete and subjective

“”

Figure 3. Process Visualization of the Refund Process for Cases Were Started Via the Call Center (a) and Via the Internet Portal (b).In the case of the internet cases missing information has to be requested too often. In the call center-initiated process, however,the problem does not exist.



monograph

Like Maury did with the naval log books, objective process mapscan be derived that show how processes

actually work in the real world

performed by multiple persons, oftendistributed across different organizationalunits or even companies.

Everybody only sees a part of the process.The manual discovery through classicalworkshops and interviews is costly and time-consuming, remaining incomplete andsubjective. With Process Mining tools it ispossible to leverage existing IT data fromoperational systems to quickly andobjectively visualize the As-Is processes asthey are really taking place.

In workshops with process stakeholders onecan then focus on the root cause analysisand the value-adding process improvementactivities.

3. A Case StudyIn one of our projects we have analyzed arefund process of a big electronicsmanufacturer. The fol lowing process

description has been slightly changed toprotect the identity of the manufacturer.The starting point for the project was thefeeling of the process manager that theprocess had severe problems. Customercomplaints and the inspection of individualcases indicated that there were lengthythroughput times and other inefficiencies inthe process.

The project was performed in the phases:First, the concrete questions and problemswere collected, and the IT logs of all casesfrom the running business year were extractedfrom the corresponding service platform.Then, in an interactive workshop involvingthe process managers the log data wereanalyzed.

For example, in Figure 3 you see a simplifiedfragment of the beginning of the refund process.On the left side (a) is the process for all casesthat were initiated via the call center. On the

right side (b) you see the same process fragmentfor all cases that were initiated through theinternet portal of the manufacturer. Bothprocess visualizations were automaticallyconstructed using Fluxicon’s process miningsoftware Disco based on the IT log data thathad been extracted.

The numbers, the thickness of the arcs, andthe coloring all illustrate how frequentlyeach activity or path has been performed.For example, the visualization of the callcenter-initiated process is based on 50 cases(see left in Figure 3). All 50 cases start withactivity Order created. Afterwards, the requestis immediately approved in 47 cases. In 3 casesmissing information has to be requested fromthe customer. For simplicity, only the mainprocess flows are displayed here.

What becomes apparent in Figure 3 is that,although missing information should onlyoccasionally be requested from the customer,

“”

Figure 4. Screenshot of the Process Mining Software Disco in the Performance Analysis View. It becomes apparent that theshipment through the forwarding company causes a bottleneck.

9



References

this happens a lot for cases that are startedvia the internet portal: For 97% of all cases(77 out of 83 completed cases) this additionalprocess step was performed. For 12 of the 83analyzed cases (ca. 14%) this happened evenmultiple times (in total 90 times for 83cases).

This process step costs a lot of time becauseit requires a call or an email from the serviceprovider. In addition, through the externalcommunication that is required, the processis delayed for the customer, who in a refundprocess has already had a bad experience.Therefore, the problem needs to be solved.An improvement of the internet portal (withrespect to the mandatory information in theform that submits the refund request) couldensure that information is complete whenthe process is started.

Another analysis result was a detectedbottleneck in connection with the pick-upsthat were performed through the forwardingcompany. The process fragment in Figure 4shows the average waiting times between theprocess steps based on the timestamps inthe historical data.

Also such waiting times analyses areautomatically created by the process miningsoftware. You can see that before and afterthe process step Shipment via forwardingcompany a lot of time passes. For example,it takes on average ca. 16 days betweenShipment via forwarding company andProduct received. The company discoveredthat the root cause for the long waiting timeswas that products were collected in a paletteand the palette was shipped only when it wasfull, which led to delays particularly forthose products that were placed in an almostempty palette. Also the actual refund processat the electronics manufacturer was takingtoo long (on average ca. 5 days). For thecustomer the process is only completed whenshe has her money back.

As a last result of the process mining analysis,deviations from the required process weredetected. It is possible to compare the log data(and therewith the actual process) objectivelyand completely against required business rules,and to isolate those cases that show deviations.Specifically, we found that (1) in one case thecustomer received the refund twice, (2) in twocases the money was refunded without ensuringthat the defect product had been received bythe manufacturer, (3) in a few cases animportant and mandatory approval step in theprocess had been skipped.

4. State of the ArtProcess mining, which is still a young andrelatively unknown discipline, is being madeavailable by the first professional softwaretools on the market and supported by

[1] Tim Zimmermann. The Race: Extreme Sailing andIts Ultimate Event: Nonstop, Round-the-World, NoHolds Barred. Mariner Books, 2004. ISBN-10:0618382704.[2] W. Brian Arthur. The Second Economy. McKinseyQuarterly, 2011.[3] Wil M.P. van der Aalst. Process Mining:Discovery, Conformance and Enhancement of Busi-ness Processes. Springer-Verlag, 2011. ISBN-10:3642193447.[4] Alberto Manuel. Process Mining - Ana Aeroportosde Portugal, 2012. BPTrends, <www.bptrends.com>.[5] IEEE Task Force on Process Mining. <http://www.win.tue.nl/ieeetfpm/>.[6] IEEE Task Force on Process Mining. ProcessMining Manifesto. Business Process ManagementWorkshops 2011, Lecture Notes in BusinessInformation Processing, Vol. 99, Springer-Verlag,2011.[7] Anne Rozinat. How to Reduce Waste With ProcessMining, 2011. BPTrends, <www.bptrends.com>.[8] Mark A. Thornton. General Circulation and theSouthern Hemisphere, 2005. <http://www.lakeeriewx.com/Meteo241/ResearchTopicTwo/ProjectTwo.html>.

published case studies [4 |. The IEEE TaskForce on Process Mining [5] was founded in2009 to increase the visibility of processmining. In autumn 2011, it published aProcess Mining Manifesto [6], which isavailable in 13 languages.

Companies already generate vast quantitiesof data as a byproduct of their IT-enabledbusiness processes. This data can be directlyanalyzed by process mining tools. Like Maurydid with the naval log books, objectiveprocess maps can be derived that show howprocesses actually work in the real world [7].Developments in the field of Big Data arehelping to store and access this data toanalyze it effectively.

Matthew Fontaine Maury’s wind and currentbooks were so useful that by the mid-1850s,their use was even made compulsory byinsurers [8] in order to prevent marineaccidents and to guarantee plain sailing.Likewise, in Business process analysis andoptimization, there will come a point whenwe can not imagine a time when we were everwithout it and left to rely on our gut feeling.

http://www.bptrends.

http://www.win.tue.nl/ieeetfpm/

http://www.bptrends.com

http://www.



monograph

1. Process Mining SpectrumProcess mining aims to discover, monitorand improve real processes by extractingknowledge from event logs readily availablein today’s information systems [1][2].

Although event data are omnipresent,organizations lack a good understanding oftheir actual processes. Managementdecisions tend to be based on PowerPointdiagrams, local politics, or managementdashboards rather than an careful analysis ofevent data. The knowledge hidden in eventlogs cannot be turned into actionableinformation. Advances in data mining madeit possible to find valuable patterns in largedatasets and to support complex decisionsbased on such data. However, classical datamining problems such as classification,clustering, regression, association rulelearning, and sequence/episode mining arenot process-centric.

Therefore, Business Process Management(BPM) approaches tend to resort to hand-made models. Process mining researchaims to bridge the gap between data miningand BPM. Metaphorically, process miningcan be seen as taking X-rays to diagnose/predict problems and recommend treat-ment.

An important driver for process mining is theincredible growth of event data [4][6]. Eventdata is everywhere – in every sector, in everyeconomy, in every organization, and in everyhome one can find systems that log events.For less than $600, one can buy a disk drivewith the capacity to store all of the world’smusic [6]. A recent study published inScience, shows that storage space grew from2.6 optimally compressed exabytes (2.6 x 1018 bytes) in 1986 to 295 compressed exabytesin 2007. In 2007, 94 percent of al linformation storage capacity on Earth wasdigital. The other 6 percent resided in books,magazines and other non-digital formats.This is in stark contrast with 1986 when only0.8 percent of all information storage capacitywas digital. These numbers illustrate theexponential growth of data.

Process Mining:X-Ray Your Business Processes

Wil van der AalstTechnical University of Eindhoven, TheNetherlands

<w. m.p .v . d .aals t@tue . n l>

Abstract: Recent breakthroughs in process mining research make it possible to discover, analyze, andimprove business processes based on event data. Activities executed by people, machines, and soft-ware leave trails in so-called event logs. Events such as entering a customer order into SAP, checkingin for a flight, changing the dosage for a patient, and rejecting a building permit have in common thatthey are all recorded by information systems. Over the last decade there has been a spectaculargrowth of data. Moreover, the digital universe and the physical universe are becoming more and morealigned. Therefore, business processes should be managed, supported, and improved based on eventdata rather than subjective opinions or obsolete experiences. The application of process mining inhundreds of organizations has shown that both managers and users tend to overestimate their knowledgeof the processes they are involved in. Hence, process mining results can be viewed as X-rays showingwhat is really going on inside processes. Such X-rays can be used to diagnose problems and suggestproper treatment. The practical relevance of process mining and the interesting scientific challengesmake process mining one of the "hot" topics in Business Process Management (BPM). This articleprovides an introduction to process mining by explaining the core concepts and discussing variousapplications of this emerging technology.

Keywords: Business Intelligence, Business Process Management, Data Mining, Management,Measurement, Performance, Process Mining,

Author

Wil van der Aalst is a professor at the Technical University in Eindhoven and with an H-index of over 90points the most cited computer scientist in Europe. Well known through his work on the WorkflowPatterns, he is the widely recognized "godfather" of process mining. His personal website is <http://www.vdaalst.com>.

© 2013 ACM, Inc. Van der Aalst, W.M.P. 2012.Process Mining. Communications of the ACMCACM Volume 55 Issue 8 (August 2012) Pages 76-83, <http://doi.acm.org/10.1145/2240236.2240257>. Included here by permission.

The further adoption of technologies suchas RFID (Radio Frequency Identification),location-based services, cloud computing,and sensor networks, will further acceleratethe growth of event data. However,organizations have problems effectively usingsuch large amounts of event data. In fact,most organizations still diagnose problemsbased on fiction (Powerpoint slides, Visiodiagrams, etc.) rather than facts (event data).This is illustrated by the poor quality ofprocess models in practice, e.g., more than20% of the 604 process diagrams in SAP’sreference model have obvious errors andtheir relation to the actual business processessupported by SAP is unclear [7]. Therefore,it is vital to turn the massive amounts ofevent data into relevant knowledge andreliable insights. This is where process miningcan help.

The growing maturity of process mining isillustrated by the Process Mining Manifesto[5] recently released by the IEEE Task For-ce on Process Mining. This manifesto issupported by 53 organizations and 77 processmining experts contributed to it. The activecontributions from end-users, tool vendors,

consultants, analysts, and researchersillustrate the significance of process miningas a bridge between data mining and busi-ness process modeling.

Starting point for process mining is an eventlog. Each event in such a log refers to anactivity (i.e., a well-defined step in someprocess) and is related to a particular case(i .e., a process instance). The eventsbelonging to a case are ordered and can beseen as one "run" of the process. Event logsmay store additional information aboutevents. In fact, whenever possible, processmining techniques use extra information suchas the resource (i.e., person or device)executing or initiating the activity, thetimestamp of the event, or data elementsrecorded with the event (e.g., the size of anorder).

Event logs can be used to conduct threetypes of process mining as shown in Figure1 [1]. The first type of process mining isdiscovery. A discovery technique takes anevent log and produces a model withoutusing any a-priori information. Processdiscovery is the most prominent process

mailto:t@tue.

http://www.vdaalst.com

http://doi.acm.org/10.1145/2240236.2240

11



Figure 1. The Three Basic Types of Process Mining Explained in Terms of Input and Output.

Conformance checking can be used to check if reality,as recorded in the log, conforms to the model and vice versa

mining technique. For many organizations itis surprising to see that existing techniquesare indeed able to discover real processesmerely based on example behaviors recordedin event logs.

The second type of process mining isconformance. Here, an existing processmodel is compared with an event log of thesame process. Conformance checking canbe used to check if reality, as recorded in thelog, conforms to the model and vice versa.The third type of process mining isenhancement. Here, the idea is to extend orimprove an existing process model usinginformation about the actual processrecorded in some event log. Whereasconformance checking measures thealignment between model and reality, thisthird type of process mining aims at changingor extending the a-priori model. For instance,by using timestamps in the event log one canextend the model to show bottlenecks, servicelevels, throughput times, and frequencies.

2. Process DiscoveryAs shown in Figure 1, the goal of processdiscovery is to learn a model based on someevent log. Events can have all kinds ofattributes (timestamps, transactionalinformation, resource usage, etc.). Thesecan all be used for process discovery.

However, for simplicity, we often representevents by activity names only. This way, acase (i .e., process instance) can berepresented by a trace describing a sequenceof activities.

Consider for example the event log shown inFigure 1 (example is taken from [1]). Thisevent log contains 1,391 cases, i.e., instancesof some reimbursement process. There are455 process instances following trace acdeh.Activities are represented by a singlecharacter: = register request, b = examinethoroughly, c = examine casually, d = checkticket, e = decide, f = reinitiate request, g =pay compensation, and h = reject request.Hence, trace acdeh models a reimbursementrequest that was rejected after a registration,examination, check, and decision step. 455cases followed this path consisting of fivesteps, i .e., the first l ine in the tablecorresponds to 455 x 5 = 2,275 events. Thewhole log consists of 7,539 events.

Process discovery techniques produceprocess models based on event logs such asthe one shown in Figure 2. For example, theclassical -algorithm produces model M1for this log. This process model is representedas a Petri net. A Petri net consists of placesand transitions. The state of a Petri net, alsoreferred to as marking, is defined by the

distribution of tokens over places.

A transition is enabled if each of its input

places contains a token. For example, is

enabled in the initial marking of , because

the only input place of contains a token

(black dot). Transition in is onlyenabled if both input places contain a token.An enabled transition may fire therebyconsuming a token from each of its inputplaces and producing a token for each of its

output places. Firing in the initial markingcorresponds to removing one token fromstart and producing two tokens (one for

each output place). After firing , three

transitions are enabled: , , and .

Firing will disable because the tokenis removed from the shared input place (and

vice versa). Transition is concurrent with

and , i.e., it can fire without disabling

another transition. Transition becomes

enabled after and or have occurred.

After executing three transitions become

enabled: , , and . These transitionsare competing for the same token thus

modeling a choice. When or is fired,the process ends with a token in place end.

If is fired, the process returns to the state

just after executing .

Note that transition is concurrent with

and . Process mining techniques needto be able to discover such more advancedprocess patterns and should not be restrictedto simple sequential processes.

It is easy to check that all traces in the eventlog can be reproduced by . This does nothold for the second process model in Figure2. is only able to reproduce the most

frequent trace . The model doesnot fit the log well because observed traces

such as are not possible according

to . The third model is able to reproduce

the entire event log, but also allows for

traces such as and .

“ ”



monograph

Figure 2. One Event Log and Four Potential Process Models (M1, M2, M3 and M4) Aiming to Describe the Observed Behavior.

Therefore, we consider to be"underfitting"; too much behavior is allowedbecause clearly overgeneralizes the

observed behavior. Model is also able toreproduce the event log. However, the modelsimply encodes the example traces in thelog. We call such a model "overfitting" as themodel does not generalize behavior beyondthe observed examples.

In recent years, powerful process miningtechniques have been developed that canautomatically construct a suitable processmodel given an event log. The goal of suchtechniques is to construct a simple modelthat is able to explain most of the observedbehavior without "overfi tting" or"underfitting" the log.

3. Conformance CheckingProcess mining is not limited to processdiscovery. In fact, the discovered process ismerely the starting point for deeper analysis.As shown in Figure 1 , conformancechecking and enhancement relate model andlog. The model may have been made by handor discovered through process discovery.For conformance checking, the modeledbehavior and the observed behavior (i.e.,event log) are compared. When checking theconformance of with respect to the logshown in Figure 2 it is easy to see that only

the 455 cases that followed can bereplayed from begin to end. If we try to

replay trace , we get stuck after

executing because is not enabled.

If we try to replay trace , we getstuck after executing the first step because

is not (yet) enabled.

There are various approaches to diagnoseand quantify conformance. One approach isto find an optimal alignment between eachtrace in the log and the most similar behaviorin the model. Consider for example processmodel , a fitting trace , a

non-fitting trace , and the

three alignments shown in Table 1.

shows a perfect alignment between and

: all moves of the trace in the event log(top part of alignment) can be followed by

13



moves of the model (bottom part ofalignment). shows an optimal alignment

for trace in the event log and model .

The first two moves of the trace in the eventlog can be followed by the model. However,

is not enabled after executing just and

. In the third position of alignment , we

see a move of the model that is notsynchronized with a move in the event log. Amove in just the model is denoted as

). In the next three moves model andlog agree. In the seventh position ofalignment there is just a move of the

model and not a move in the log: ).

shows another optimal alignment for

trace . Here there are two situationswhere log and model do not move together:

) and ). Alignments and

are both optimal if the penalties for"move in log" and "move in model" are thesame. In both alignments there are two steps and there are no alignments with lessthan two steps.

Conformance can be viewed from two angles:(a) the model does not capture the realbehavior ("the model is wrong") and (b)reality deviates from the desired model "theevent log is wrong"). The first viewpoint istaken when the model is supposed to bedescriptive, i.e., capture or predict reality.The second viewpoint is taken when themodel is normative, i.e., used to influence orcontrol reality.

There are various types of conformance andcreating an alignment between log and modelis just the starting point for conformancechecking [1]. For example, there are variousfitness (the ability to replay) metrics. Amodel has fitness 1 if all traces can bereplayed from begin to end. A model hasfitness 0 if model and event log "disagree" onall events. Process models , and

have a fitness of 1 (i.e., perfect fitness)with respect to the event log shown in Figu-re 2 Model has a fitness 0.8 for theevent log consisting of 1,391 cases.

Intuitively, this means that 80% of the eventsin the log can be explained by the model.Fitness is just one of several conformancemetrics.

Experiences with conformance checking indozens of organizations show that real-lifeprocesses often deviate from the simplifiedVisio or PowerPoint representations used byprocess analysts.

4. Model EnhancementIt is also possible to extend or improve anexisting process model using the alignmentbetween event log and model. A non-fittingprocess model can be corrected using thediagnostics provided by the alignment. If thealignment contains many ) moves,then it may make sense to allow for the

skipping of activity in the model.Moreover, event logs may containinformation about resources, timestamps,and case data. For example, an event referringto activity "register request" and case"992564" may also have attributes describingthe person that registered the request (e.g.,"John"), the time of the event (e.g., "30-11-2011:14.55"), the age of the customer (e.g.,"45"), and the claimed amount (e.g., "650euro"). After aligning model and log it ispossible to replay the event log on the model.While replaying one can analyze theseadditional attributes.

For example, as Figure 3 shows, it ispossible to analyze waiting times in-betweenactivities. Simply measure the time differencebetween causally related events and compu-te basic statistics such as averages, variances,and confidence intervals. This way it ispossible to identify the main bottlenecks.Information about resources can be used todiscover roles, i.e., groups of peoplefrequently executing related activities. Here,standard clustering techniques can be used.It is also possible to construct social networksbased on the flow of work and analyzeresource performance (e.g., the relationbetween workload and service times).

Standard classification techniques can beused to analyze the decision points in the

process model. For example, activity ("de-

cide") has three possible outcomes ("pay","reject", and "redo"). Using the data knownabout the case prior to the decision, we canconstruct a decision tree explaining theobserved behavior.

Figure 3 illustrates that process mining isnot limited to control-f low discovery.Moreover, process mining is not restrictedto offline analysis and can also be used forpredictions and recommendations atruntime. For example, the completion timeof a partially handled customer order can bepredicted using a discovered process modelwith timing information.

5. Process Mining CreatesValue in Several WaysAfter introducing the three types of processmining using a small example, we now focuson the practical value of process mining. Asmentioned earlier, process mining is drivenby the exponential growth of event data. Forexample, according to MGI, enterprisesstored more than 7 exabytes of new data ondisk drives in 2010 while consumers storedmore than 6 exabytes of new data on devicessuch as PCs and notebooks [6].

In the remainder, we will show that processmining can provide value in several ways. Toillustrate this we refer to case studies wherewe used our open-source software packageProM [1]. ProM was created and ismaintained by the process mining group atEindhoven University of Technology.However, research groups from all over theworld contributed to it, e.g., University ofPadua, Universitat Politècnica de Catalunya,University of Calabria, Humboldt-Universität zu Berlin, Queensland Universityof Technology, Technical University ofLisbon, Vienna University of Economicsand Business, Ulsan National Institute ofScience and Technology, K.U. Leuven,Tsinghua University, and University ofInnsbruck. Besides ProM there are about 10commercial software vendors providingprocess mining software (often embeddedin larger tools), e.g., Pallas Athena, Soft-ware AG, Futura Process Intelligence,Fluxicon, Businesscape, Iontas/Verint,Fujitsu, and Stereologic.

A Petri net consists of places and transitions.The state of a Petri net, also referred to as marking, is defined

by the distribution of tokens over places

Table 1. Examples of Alignment Between the Traces in the Event Log and the Model.

“”



monograph

It is also possible to extend or improve an existing processmodel using the alignment between event log and model“ ”

Figure 3. The Process Model Can Be Extended Using Event Attributes Such as Timestamps, Resource Information and Case Data. The modelalso shows frequencies, e.g. 1,537 times a decision was made and 930 cases where rejected.!

5.1. Provide InsightsIn the last decade, we have applied ourprocess mining software ProM in over 100organizations. Examples are municipalities(about 20 in total, e.g., Alkmaar, Heusden,and Harderwijk), government agencies (e.g.,Rijkswaterstaat, Centraal Justitieel IncassoBureau, and the Dutch Justice department),insurance related agencies (e.g., UWV),banks (e.g., ING Bank), hospitals (e.g., AMChospital and Catharina hospital),multinationals (e.g., DSM and Deloitte),high-tech system manufacturers and theircustomers (e.g., Philips Healthcare, ASML,Ricoh, and Thales), and media companies(e.g., Winkwaves). For each of theseorganizations, we discovered some of theirprocesses based on the event data theyprovided. In each discovered process, therewere parts that surprised some of thestakeholders. The variability of processes istypically much bigger than expected. Suchinsights represent a tremendous value assurprising differences often point to wasteand mismanagement.

5.2. Improve PerformanceAs explained earlier, it is possible to replayevent logs on discovered or hand-madeprocess models. This can be used forconformance checking and modelenhancement. Since most event logs containtimestamps, replay can be used to extend themodel with performance information.

Figure 4 illustrates some of the performan-ce-related diagnostics that can be obtainedthrough process mining. The model shownwas discovered based on 745 objectionsagainst the so-called WOZ ("WaarderingOnroerende Zaken") valuation in a Dutchmunicipality. Dutch municipalities need toestimate the value of houses and apartments.The WOZ value is used as a basis fordetermining the real-estate property tax. Thehigher the WOZ value, the more tax theowner needs to pay. Therefore, many citizensappeal against the WOZ valuation and assertthat it is too high.

Each of the 745 objections corresponds to a

process instance. Together these instancesgenerated 9,583 events all havingtimestamps. Figure 4 shows the frequencyof the different paths in the model. Moreover,the different stages of the model are coloredto show where, on average, most time isspent. The purple stages of the process takemost time whereas the blue stages take theleast time. It is also possible to select twoactivities and measure the time that passesin-between these activities.

As shown in Figure 4, on average, 202.73days pass in-between the completion of activity"OZ02 Voorbereiden" (preparation) and thecompletion of "OZ16 Uitspraak" (finaljudgment). This is longer than the averageoverall flow time which is approx. 178 days.About 416 of the objections (approx. 56%)follow this route; the other cases follow thebranch "OZ15 Zelf uitspraak" which, on avera-ge, takes less time.

Diagnostics as shown in Figure 4 can beused to improve processes by removing

15



Figure 4. Performance Analysis Based on 745 Appeals against the WOZ Valuation.

bottlenecks and rerouting cases. Since themodel is connected to event data, it is possibleto "drill down" immediately and investigategroups of cases that take more time thanothers [1].

5.3. Ensure ConformanceReplay can also be used to checkconformance as is illustrated by Figure 5.Based on 745 appeals against the WOZvaluation, we also compared the normativemodel and the observed behavior: 628 of the745 cases can be replayed withoutencountering any problems. The fitness ofthe model and log is 0.98876214 indicatingthat almost all recorded events are explainedby the model. Despite the good fitness,ProM clearly shows all deviations. Forexample, "OZ12 Hertaxeren" (reevaluateproperty) occurred 23 times while this wasnot allowed according to the normative model(indicated by the "-23" in Figure 5). Againit is easy to "drill down" and see what thesecases have in common.

The conformance of the appeal process justdescribed is very high (about 99% of eventsare possible according to the model). Wealso encountered many processes with avery low conformance, e.g., it is notuncommon to find processes where only40% of the events are possible according tothe model. For example, process miningrevealed that ASML’s modeled test processstrongly deviated from the real process [9].The increased importance of corporategovernance, risk and compliance manag-ement, and legislation such as the Sarbanes-Oxley Act (SOX) and the Basel II Accord,i llustrate the practical relevance ofconformance checking. Process mining canhelp auditors to check whether processes areexecuted within certain boundaries set bymanagers, governments, and other stakeholders [3]. Violations discovered throughprocess mining may indicate fraud,malpractice, risks, and inefficiencies. Forexample, in the municipality where weanalyzed the WOZ appeal process, we

discovered misconfigurations of theireiStream workflow management system.People also bypassed the system. This waspossible because system administratorscould manually change the status of cases[8].

5.4. Show VariabilityHand-made process models tend to providean idealized view on the business processthat is modeled. Often such a "PowerPointreality" has little in common with the realprocesses that have much more variability.However, to improve conformance and per-formance, one should not abstract away thisvariability.

In the context of process mining we often seeSpaghetti-like models such as the one shownin Figure 6. The model was discoveredbased on an event log containing 24,331events referring to 376 different activities.The event log describes the diagnosis andtreatment of 627 gynecological oncology

Often such a ‘PowerPoint reality’ has little in common withthe real processes that have much more variability.

However, to improve conformance and performance,one should not abstract away this variability

“”



monograph

Figure 6. Process Model Discovered for a Group of 627 Gynecological Oncology Patients.

Figure 5. Conformance Analysis Showing Deviations between Eventlog and Process.

17



References

patients in the AMC hospital in Amsterdam.The Spaghetti-like structures are not causedby the discovery algorithm but by the truevariability of the process.

Although it is important to confrontstakeholders with the reality as shown inFig. 6, we can also seamlessly simplify Spag-hetti-like models. Just like using electronicmaps it is possible to seamlessly zoom in andout [1]. While zooming out, insignificantthings are either left out or dynamicallyclustered into aggregate shapes – like streetsand suburbs amalgamate into cities in GoogleMaps. The significance level of an activity orconnection may be based on frequency,costs, or time.

5.5. Improve ReliabilityProcess mining can also be used to improvethe reliability of systems and processes. Forexample, since 2007 we have been involvedin an ongoing effort to analyze the event logsof the X-ray machines of Philips Healthcareusing process mining [1]. These machinesrecord massive amounts of events. Formedical equipment it is essential to provethat the system was tested under realisticcircumstances. Therefore, process discoverywas used to construct realistic test profiles.Philips Healthcare also used process miningfor fault diagnosis. By learning from earlierproblems, it is possible to find the root causefor new problems that emerge. For example,using ProM, we have analyzed under whichcircumstances particular components arereplaced. This resulted in a set of signatures.When a malfunctioning X-ray machineexhibits a particular "signature" behavior,the service engineer knows what componentto replace.

5.6. Enable PredictionThe combination of historic event data withreal-time event data can also be used topredict problems. For instance, PhilipsHealthcare can anticipate that an X-ray tubein the field is about to fail by discoveringpatterns in event logs. Hence, the tube canbe replaced before the machine starts tomalfunction.

Today, many data sources are updated in(near) real-time and sufficient computingpower is available to analyze events as theyoccur. Therefore, process mining is notrestricted to off-line analysis and can also beused for online operational support. For arunning process instance it is possible tomake predictions such as the expectedremaining flow time [1].

6. ConclusionProcess mining techniques enableorganizations to X-ray their business processes,diagnose problems, and get suggestions fortreatment. Process discovery often provides

[1] W. van der Aaalst. Process Mining: Discovery,Conformance and Enhancement of BusinessProcesses. Springer-Verlag, Berlin, 2011. ISBN:978-3-642-19345-3.[2] W. van der Aaalst. Using Process Mining toBridge the Gap between BI and BPM. IEEE Computer44, 12, pp. 77–80, 2011.[3] W. van der Aaalst, K. van Hee, J.M. van Werf,M. Verdonk. Auditing 2.0: Using Process Mining toSupport Tomorrow’s Auditor. IEEE Computer 43, 3,pp. 90–93, 2010.[4] M. Hilbert, P.Lopez. The World’s TechnologicalCapacity to Store, Communicate, and ComputeInformation. Science 332, 6025, pp. 60–65, 2011.[5] TFPM Task Force on Process Mining. ProcessMining Manifesto. Business Process ManagementWorkshops, F. Daniel, K. Barkaoui, and S. Dustdar,Eds. Lecture Notes in Business InformationProcessing Series, vol. 99. Springer-Verlag, Berlin,pp. 169–194, 2012.[6] J. Manyika, M. Chui, B. Brown, J. Bughin, R.Dobbs, C. Roxburgh, A. Byers. Big Data: The NextFron tie r for Innovation, Competi tion, andProductivity. McKinsey Global Institute, 2011.<http://www. mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation>.[7] J. Mendling, G. Neumann, W. van der Aalst.Understanding the Occurrence of Errors in ProcessModels Based on Metrics. Proceedings of theOTM Conference on Cooperative informationSystems (CoopIS 2007). En F. Curbera, F. Leymann,and M. Weske, Eds. Lecture Notes in ComputerScience Series, vol. 4803. Springer-Verlag, Berlin,pp. 113–130, 2007.[8] A. Rozinat, W. van der Aalst. ConformanceChecking of Processes Based on Monitoring RealBehavior. Information Systems 33, 1, pp. 64–95,2008.[9] A. Rozinat, I. de Jong, C. Günther, W. van derAalst. Process Mining Applied to the Test Processof Wafer Scanners in ASML. IEEE Transactions onSystems, Man and Cybernetics, Part C 39, 4, pp.474–479, 2009.

new and surprising insights. These can be usedto redesign processes or improve management.Conformance checking can be used to seewhere processes deviate. This is very relevantas organizations are required to put moreemphasis on corporate governance, risks, andcompliance. Process mining techniques offer ameans to more rigorously check compliancewhile improving performance.

This article introduced the basic concepts andshowed that process mining can provide valuein several ways. The reader interested in processmining is referred to the first book on processmining [1] and the process mining manifesto[5] which is available in 12 languages. Alsovisit <www.processmining.org> for samplelogs, videos, slides, articles, and software.

The author would like to thank the membersof the IEEE Task Force on Process Miningand all that contributed to the ProcessMining Manifesto and the ProM framework.

http://www.

http://www.processmining.org



monograph

1. IntroductionThe speed at which data grows in IT systems[1] makes it crucial to rely on automation inorder to enable enterprises and institutionsto manage their processes. Automatedtechniques open the door for dealing withlarge amounts of data, a mission unthinkablefor a human’s capabilities. In this paper wediscuss one of these techniques: thediscovery of process models. We nowillustrate the main task behind processdiscovery by means of a (hopefully) funnyexample.

2. A Funny Example: The Visitof an AlienImagine that an alien visits you (see Figure1) and, by some means, i t wants tocommunicate the plan it has regarding itsvisit to the Earth. For obvious reasons, wecannot understand the alien’s messages, thatlook like the one shown in Figure 2.

Although not knowing the meaning of eachindividual letter in the message above, onemay detect that there are some patterns,e.g., a repetition for the sequence I A C D ME (first and last six letters in the sequence).So the question is: how can we represent thebehavior of the aliens without knowing exactlythe meaning of each single piece ofinformation?

The Process Discovery JourneyJosep CarmonaSoftware Department, Technical Universityof Catalonia, Spain

<j ca r mon a@l s i .up c . edu>

Abstract: Process models are an invaluable element of an IT system: they can be used to analyze,monitor, or improve the real processes that provide the system’s functionality. Technology has enabledIT systems to store in file logs the footprints of process executions, which can be used to derive theprocess models corresponding with the real processes, a discipline called Process Discovery. Weprovide an overview of the discipline together with some of the alternatives that exist nowadays.

Keywords: Formal Methods, Process Discovery, Software Engineering.

Author

Josep Carmona received his MS and PhD degrees in Computer Science from the Technical Universityof Catalonia, in 1999 and 2004, respectively. He is an associate professor in the Software Departmentof the same university. His research interests include formal methods, concurrent systems, and processand data mining. He has co-authored more than 50 research papers in conferences and journals.

Figure 1. Our Imaginary Alien.

Process discovery may be a good solutionfor this situation: a process discoveryalgorithm will try to produce a (formal)model of the behavior underlying a set ofsequences. For instance, the following for-mal model in the Business Process ModelingNotation (BPMN) [2] shown in Figure 3represents very accurately the behaviorexpressed in the alien’s sequences. For thosenot familiar with the BPMN notation, themodel above describes the following process:after I occurs, then (‘x’ gateway) eitherbranch B followed by X occurs, or branch Afollowed by C and D in parallel(‘+’ gateway),and then M occurs. Both branches activateE which in turn reactivates I. Clearly, evenwithout knowing anything about the actionstaken from the alien, the global structuringof these activities becomes very apparentfrom a simple inspection of the BPMN model.

Now imagine that at some point the meaningof each letter is decrypted: evaluate theamount of energy in the Earth (I), highenergy (B), invade the Earth (X), low energy(A), gather some human samples (C), learnthe human reproduction system (D), teachhumans to increase their energy resources(M), communicate the situation to the aliensin the closest UFO (E). In the presence ofthis new information, the value of the modelobtained is signif icantly incremented(although maybe one may not be relaxedafter realizing the global situation that themodel brings into light).

Figure 2. A Message Sent by the Alien.

I A C D M E I B X E I A D C M E I B X E I A C D M E

3. Anatomy of a Simple ProcessDiscovery AlgorithmThe previous example illustrates one of themain tasks of a process discovery algorithm:given a set of traces (cal led log)corresponding to a particular behavior understudy, derive a formal model which representsfaithfully the process producing these tra-ces. In its simplest form, process discoveryalgorithms focus on the control-f lowperspective of the process, i.e., the orderingactivities are performed in order to carry outthe process tasks. The previous example hasconsidered this perspective.

A log must contain enough information toextract the sequencing of the activities thatare monitored. Typically, a trace identifier,an activity name and a time stamp arerequired to enable the correspondingsequencing (by the time stamp) for theactivities belonging to a given trace(determined by the trace identifier). Otherinformation may be required if the discoveryengine must take into account additionalinformation, like resources (what quantitywas purchased?), activity originator (whoperformed that activity?), activity duration(how long does activity X last?), amongothers. An example of a discovery algorithmthat takes into account other dimension isthe social network miner [3], that derivesthe network of collaborators that carry out agiven process.

19



The core of a process discovery algorithm isthe abi li ty to extract the necessaryinformation required to learn a model thatwill represent the process. Process discoveryis often an unsupervised learning task, sincethe algorithm is usually exposed only topositive examples, i.e., successful executionsof the process under study: in the example ofthe introduction, we were only exposed towhat the alien plans to do, but we do notknow what the alien does not plan to do.This complicates the learning task, sinceprocess discovery algorithms are expectedto produce models that are both precise (themodel produced should not deviate muchfrom the behavior seen) and general (themodel should general ize the patternsobserved in the log) [4]. Obviously, thepresence of negative examples would helpthe discovery algorithm into improving thesetwo quality metrics, but negative informationis often not available on IT logs.

How to learn a process model from a set oftraces? Various algorithms exist nowadaysfor various models (see Sect ion 4 ).However, let us use the alien’s example toreason on the discovery of the BPMN modelabove. If we focus on the first letter of thesequence (I), it is sometimes followed by Aand sometimes by B, and always (except forthe first occurrence) preceded by E. Theseobservations can be expressed graphically asshown in Figure 4.

In BPMN notation, the or-exclusive relationbetween the occurrences of either A or Bafter I is modeled by using the ‘x’ gateway.The precedence between E and I is modeledby an edge connecting both letters in themodel. Symmetrically, E is preceded eitherby M or by X. Also, following A both C andD occur in any order. The well-known alphaalgorithm [5] can find most of these pair-

wise ordering relations in the log, and onemay use them to craft the BPMN model asTable 1 illustrates.

Table 1 can be read as follows: if in the logA precedes B always but B is unique (thereis no other letter preceded by A), then adirected arc between A and B is created. If incontrast there is always more than one letterpreceded by A, then an ‘+’ gateway is insertedbetween A and the letters preceded by A.The sometimes relation can be read similarly.

Hence one can scan the log to extract theserelations (worst-case quadratic in the size ofthe log) and use the table to create theBPMN model. However, this is a veryrestrictive way of discovery since otherrelations available in the BPMN notationcan also be hidden in the log, like the inclu-sive-or relation, but the algorithm does notconsider them. Process di scoveryalgorithms are always in a trade-off betweenthe complexity of the algorithm and themodeling capacity: the algorithm proposedin this section could be extended toconsider also inclusive-or gateways, butthat may significantly complicate thealgorithm. Below we address informallythese and other issues.

4. Algorithms and ModelsThere are several models that can be obtainedthrough different process discoveryalgorithms: Petri nets, Event-driven ProcessChains, BPMN, C-Nets, Heuristic Nets,Business Process Maps, among others.Remarkably, most of these models aresupported by replay semantics that allowone to simulate the model in order to certifyits adequacy in representing the log.

To describe each one of these models is outof the scope of this article, but I can briefly

comment on Petri nets, which is a modeloften produced by discovery algorithms, dueto its formal semantics and ability to representconcurrency. For the model of our runningexample, the corresponding Petri net thatwould be discovered by most of the Petri netdiscovery algorithms will be as shown inFigure 5.

Those readers familiar with Petri nets willfind a perfect match between the underlyingbehavior of the Petri net and the alien’strace. Notice that while in the BPMN model,apart from the units of information (in thiscase letters of the alphabet), there are othermodel components (gateways) whosesemantics define the way the modelrepresents the log traces.

The same happens with the Petri net above,where the circles correspond to the globalbehavior of the model, which is distributedamong the net (only some circles are marked).While the discovery algorithm for BPMNneeds to find both the connections andgateways, the analogous algorithm for Petrinets must compute the circles andconnections.

Several techniques exist nowadays toaccomplish the discovery of Petri nets,ranging from the log-ordering relationsextracted by the alpha algorithm, down tovery complex graph-based structures thatare computed on top of an automatonrepresenting the log traces.

What process discovery algorithm/modelingnotation to choose? This is in fact a verygood question that can only be answeredpartially: there is no one model that is betterthan the rest, but instead models that arebetter than others only for a particular typeof behaviors. Actually, deciding the best

Figure 3. A Formal Model of Behavior in the Alien’s Sequences in BPMN.

The core of a process discovery algorithm is the abilityto extract the necessary information required to learn a

model that will represent the process

“”

I X

A

B X

+

C

D

+

M

X E



monograph

modeling notation for a log is a hard problemfor which research must provide techniquesin the next decade (a problem calledrepresentational bias selection). From apragmatic point of view, one must selectthose process modeling notations one isfamiliar with, and expect the discoveryalgorithms for that notation to be goodenough for the user needs.

As said before, other perspectives differentfrom the control-flow view may be consideredby process discovery algorithms: time,resources, organizational, etc.

The reference book [6] may be consulted inorder to dig into these other processdiscovery algorithms.

5. ToolsProcess discovery is a rather new discipline,if compared with related areas such as datamining or machine learning. In spite of this,one can find process mining tools both inacademia (mostly) but also in industry.

The following classification is by no meansexhaustive, but instead reports some of theprominent tools one can use to experience

with process discovery tools:ACADEMIA: the ProM Framework, fromTechnical University of Eindhoven (TU/e) isthe reference tool nowadays. It is the result ofa great academic collaboration among severaluniversities in the world to gather algorithmicsupport for process mining (i.e., not onlyprocess discovery). Additionally, differentgroups have developed several academic stand-alone tools that incorporate modern processdiscovery algorithms.INDUSTRY: some important companieshave invested an effort into building processdiscovery tools, e.g., Fujitsu (APD), butalso medium-sized or start-ups that are morefocused on process mining practices, e.g.,

Always Sometimes

A precedes B

B Unique:

General:

B Unique:

General:

A B

+ C

A B

x

A B

x C

A B

Table 1. BPMN Model Built from Patterns in the Alien’s Messages.

I A C D M E I B X E I A D C M E I B X E I A C D M E

Figure 4. Patterns Observed in the Alien’s Messages.

I

A

B X

C

D

M

E

Figure 5. Petri Net for the Model of Our Running Example.

21



References

[1] S. Rogers. Data is Scaling BI and Analytics-DataGrowth is About to Accelerate Exponentially - GetReady. Information and Management - Brookfield,21(5):p. 14, 2011.[2] D. Miers, S.A. White. BPMN Modeling andReference Guide: Understanding and Using BPMN.Future Strategies Inc., 2008. ISBN-10: 0977752720.[3] W. M. P. van der Aalst, H. Reijers, M. Song.Discovering Social Networks form Event Logs.Computer Supported Cooperative Work, 14(6):pp.549-593, 2005.[4] A. Rozinat, W. M. P. van der Aalst. ConformanceChecking of Processes Based on Monitoring RealBehavior. Information Systems, 33(1):pp. 64-95, 2008.[5] W.M.P. van der Aalst, A. Weijters, L. Maruster.Workflow Mining: Discovering Process Models fromEvent Logs. IEEE Transactions on Knowledge andData Engineering, 16 (9):pp. 1128–1142, 2004.[6] W.M.P. van der Aalst. Process Mining: Discovery,Conformance and Enhancement of BusinessProcesses. Springer, 2011. ISBN-10: 3642193447.

Actually, deciding the best modeling notation fora log is a hard problem for which research must

provide techniques in the next decade

“”

Pallas Athena (ReflectOne), Fluxicon (Dis-co), Perspective Software (BPMOne, Futu-ra Reflect), Software AG (ARIS ProcessPerformance Manager), among others.

6. ChallengesThe task of process discovery may beaggravated if some of the aspects below arepresent:Log incompleteness: the log often containsonly a fraction of the total behaviorrepresenting the process. Therefore, theprocess discovery algorithm is required toguess part of the behavior that is not presentin the log, which may be in general a difficulttask. Noise: logged behavior may sometimesrepresent infrequent exceptions that are notmeant to be part of the process. Hence,process discovery algorithms may behampered when noise is present, e.g., incontrol-f low discovery some relationsbetween the activi ties may becomecontradictory. To separate noise from thevalid information in a log is a current researchdirection. Complexity: due to the magnitude ofcurrent IT logs, it is often difficult to usecomplex algorithms that may either requireloading the log into memory in order toderive the process model, or apply techniqueswhose complexity are not linear on the sizeof the log. In those cases, high level strategies(e.g., divide-and-conquer) are the onlypossibility to derive a process model.§ Visualization: even if the processdiscovery algorithm does its job and canderive a process model, it may be hard for ahuman to understand it if it has more than ahundred elements (nodes, arcs). In thosecases, a hierarchical description, similar tothe Google Maps application were one canzoom in or out of a model’s part, will enablethe understanding of a complex processmodel.

AcknowledgementsI would like to thank David Antón for creating thealien’s drawing used in this paper.



monograph

1. Roles and Responsibilitiesin ITSM ModelsAll models, standards or frameworks used inthe ITSM industry are process oriented.This is because process orientation helpsstructure the related tasks and allows theorganization to formalize the great variety ofactivities performed daily: which activitiesto execute and when, who should carry themout, who owns what responsibilities overthose tasks, which tools or informationsystems to use and what are the expectedobjectives and outcomes of the process.

One model commonly used to represent thedifferent components of the process is theITOCO model [1] Figure 1 that representsthe fundamental elements of a process:Inputs, Outputs, Tasks, Control parametersand Outcomes.

This model allows us to differentiate betweenthree different roles needed for the correctexecution of any process: process operators,who are responsible for executing thedifferent tasks; process managers, whowarrantee that the process execution meetsthe specifications and ensure that both inputsand outputs match the expectations (withinthe specified control parameters); and processowners, who use a governance perspectiveto define the process, its outcomes and theapplicable controls and policies, as well asbeing responsible to obtain and allocate theresources needed for the right execution ofthe process.

The process manager’s job is the executionof the control activities (also called thecontrol process) over the managed process,acting on the deviations or the qualityvariations of the results, and managing theallocated resources to obtain the bestpossible results. Therefore, this role requiresa combination of ski lls from diverseprofessional disciplines such as auditing,consulting and, chief ly, continuousimprovement.

2. ITSM Process ManagementThe ITSM industry has traditionally used anumber of methodological tools to enablethe process manager do the job: Definition of metrics and indicators(usually standardized from the adoptedframeworks).Usage of Balanced Scorecards to show

Using Process Mining in ITSMAntonio Valle-SalasManaging Partner of G2

<aval le@gedos .es>

Abstract: When it comes to information systems, ranging from copiers to surgical equipment orenterprise management systems, all the information about the processes executed using those systemsare frequently stored in logs. Specifically for IT Service Management processes (ITSM), it is quitecommon for the information systems used to execute and control those processes to keep structuredlogs that maintain enough information to ensure traceability of the related activities. It would be interestingto use all that information to get an accurate idea of ??how the process looks like in reality, to verify ifthe real process flow matches the previous design, and to analyze the process to improve it in orderto become more effective and efficient. This is the main goal of process mining. This paper exploresthe different capabilities of process mining and its applicability in the IT Service Management area.

Keywords: Change Management, ITSM, Process Management Tools, Process Mining, Service Desk,Services.

Author

Antonio Valle-Salas is Managing Partner of G2 and a specialist consultant in ITSM (InformationTechnology Service Management) and IT Governance. He graduated as a Technical Engineer inManagement Informatics from UPC (Universitat Politécnica de Catalunya) and holds a number ofmethodology certifications such as ITIL Service Manager from EXIN (Examination Institute for InformationScience), Certified Information Systems Auditor (CISA) from ISACA, and COBIT Based IT GovernanceFoundations from IT Governance Network, plus more technical certifications in the HP Openview familyof management tools. He is a regular collaborator with itSMF (IT Service Management Forum) Spainand its Catalan chapter, and combines consulting and project implementation activities with frequentcollaborations in educational activities in a university setting (such as UPC or the Universitat PompeuFabra) and in the world of publishing in which he has collaborated on such publications as IT Governance:a Pocket Guide, Metrics in IT Service Organizations, Gestión de Servicios TI. Una introducción a ITIL,and the translations into Spanish of the books ITIL V2 Service Support and ITIL V2 Service Delivery.

and follow those indicators.Definition of management reports (daily,weekly, monthly).Usage of various kinds of customer and/oruser satisfaction surveys. Performance of internal or externalcompliance audits.

These tools allow the process manager togain knowledge about the behavior of theprocesses she is in charge of, and to makedecisions to set the correct course of tasksand activities. However these tools arecommonly rather rigid whereas the processmanager needs a deeper analysis of theprocess behaviour.

Still, there are two key aspects of anycontinuous improvement model: to knowwhat the current situation is - as the startingpoint for the improvement trip - and tounderstand what the impact of theimprovement initiatives will be on the processand its current situation. Both aspects arerepresented in Figure 2.

At these initial stages many questions arise

regarding the daily activities of the processmanager, namely:Which is the most common flow?What happens in some specific type ofrequest?How long are the different cases at eachstate of the flow?Can we improve the flow?Where is the flow stuck?Which are the most repeated activities?Are there any bottlenecks?Are the process operators following thedefined process?Is there segregation of duties in place?

Moreover, in ITSM we usually find thatmost processes defined using frameworksdo not fully match the real needs of dailyoperations; a standard and rigid approach toprocesses does not meet the needs of thoseactivities in which the next steps are notknown in advance [2].

One clear case of this type of processes inITSM is the problem management process.Here, to be able to execute the diagnosticsand identification of root causes, the operator

mailto:[email protected]

23



will have to decide the next step according tothe results of the previous analysis. Thus,problem management is, by nature, a non-structured process whose behavior will totallydiffer from a strict process such as requestmanagement.

3. Process Mining & ITSMThe first and most delicate task when usingprocess mining techniques is obtaining a logof good quality, representative of the processwe want to analyze, and with enoughattributes to enable filtering and drivingsubsequent analysis steps as shown in Figu-re 3.

Fortunately enough, most ITSM processmanagement tools have logs that allow theactions executed by the various process actorsto be traced. These logs (e.g. Figure 4) areusually between maturity levels IV and V onthe scale proposed by the Process MiningManifesto [3].

The following steps of discovery andrepresentation are those in which the use ofprocess mining techniques providesimmediate value.

The designed processes are usually differentto the real execution of activities. This is

caused by various factors, amongst whichwe find too generalist process designs (to tryto cover non-structured processes), flexibilityof the management tools (that are frequentlyconfigured to allow free flows instead ofclosed flows) and process operator’screativity (they are not always comfortablewith a strict process definition).

For this reason, both the process owner andthe process manager usually have an idealizedview of the process, and so are deeplysurprised the first time they see a graphicrepresentation of the process from theanalysis of the real and complete information.

For instance, as mentioned in USMBOK[4], the different types of request a user canlog into a call center will be covered by asingle concept of Service Request that willthen follow a different flow or Pathway asshown in Figure 5. This flow will be "fitted"within a common flow in the correspondingmodule of the management tool used by theService Desk team.

In order to fit this wide spectrum of differenttypes of requests into a relatively generalflow we usually avoid a closed definition ofthe process and its stages (in the form of adeterministic automat) but we allow an open

flow as shown in Figure 6 in which anyoperator decides at any given time the nextstate or stage of the corresponding life cycle[2].

That is why, when we try to discover andrepresent these types of activities, we findwhat in process mining jargon is called "spag-hetti model" as shown in Figure 7. In thismodel, even with a reduced number of cases,the high volume and heterogeneoustransactions between states makes thediagram of little (if any) use.

Therefore, to facilitate analysis, we need touse some techniques to divide the probleminto smaller parts [5]. We can use clustering,or simply filtering the original log, in order toselect the type of pathway we want to analyze.

Previous to the discovery and representationtasks, it is recommended that the log isenriched with any available information thatwill later allow segmenting the data setaccording to the various dimensions ofanalysis.

For instance, in this case we will need tohave an attribute indicating the request typeor pathway to be able to break down themodel by requests, segmenting the data set

The role of the process manager requires a combinationof skills from diverse professional disciplines such as auditing,

consulting and, chiefly, continuous improvement

Figure 1. The ITOCO Model.

“”

TasksProcedures

Work InstructionsRoles

Control, Quality

Specifications,Policies

RESOURCES CAPABILITIES

Inputs Outputs

RESOURCES CAPABILITIES



monograph

PLAN

DO

CHECK

ACT

Set Goals

Gap Analysis

Current SituationAnalysis

De?nition of Improvement

Scenarios

ExecuteImprovement

Initiatives

ResultsEvaluation

Figure 2. Continuous Improvement Cycle.

Figure 3. Sequence of Process Mining Steps.

Figure 4. Sample Log.

Definition

25



The first and most delicate task when using process miningtechniques is obtaining a log of good quality, representative

of the process we want to analyze

Figure 5. Pathways, According to USMBOK.

CLOSED

IN PROGRESS

PENDING

SOLVED

REGISTERED

CLOSED

IN PROGRESS

PENDING

SOLVED

REGISTERED

Figure 6. Strict Flow vs. Relaxed Flow.

“”



monograph

and carrying on the analysis of a specifickind of request (see Figure 8).

On the other hand, we need to remember thatprocess mining techniques are independent ofthe process activity. Instead, they focus inanalyzing the changes of state.

At this point, we can be creative and thinkabout the process flow as any "change of statewithin our information system", so we can usethese techniques to analyze any other

Figure 7. Spaghetti Model.

transitions such as the task assignment flowamongst different actors, the escalation flowamongst different specialist groups or (evenlesser related to the common understanding ofa process) ticket priority changes and reclassifications (see Figure 9).

Finally, at the analysis stage it is time toanswering questions about the processbehaviour. To do this we have a broad arrayof tools:Enrich the visual representation: for example

in Figure 9 we can observe that longertransactions between operators are representedin a thicker line, or in Figure 8 we show mostfrequent states in darker color. Graphs and histograms: to representvolume or time-related information. Typicalsituations of this kind of analysis are graphicrepresentations of the number of open casesover time (backlog evolution) and histogramsshowing the distribution of duration and/orevents per case.In more analytic fields, we can obtain a

Figure 8. Filtered Spaghetti Model.

27



Figure 9. Social Map: How Cases Flow through the Different Process Operators.

To facilitate analysis we need to use sometechniques to divide the problem into smaller parts

diagram showing a Márkov Chain for ourprocess (see Figure 10). It will depict theprobability for a particular transition tohappen, to help answer questions like"what is the probability that a closed ticketwi l l be re-opened?" We can al socomplement this information with caseattributes: affected item, contact person,request type, organization etc. so that themodel for analysis is richer.

So far we have covered methodological toolsand mechanisms intended for quantitative andstatistical analysis of processes and theirbehaviour. However, there is another side ofthe analysis focusing in the specific area ofexecution, answering questions such as "arethere clear patterns of behavior in my process?","is the process execution meeting previousdefinitions or corporate policies?" [6].To answer the first question we will usethe concept of "variant". We can describea variant as the set of cases executedfollowing the same trace of sequence ofevents. Thus, it is possible that some typesof requests are always completed in acommon pattern. We will easily check thisby analyzing the variants of our process as

shown in Figure 11 (right side), where wesee 79% of cases following the same flow:Registered à Completed / Validation àClosed.

To answer the second questions aboutprocess conformance we must have a formalmodel of the process to compare with its realexecution. Once we have this piece, we cancarry out different approaches to the problemof validation of conformance, as describedby Anne Rozinat in her paper ConformanceChecking of Processes Based on MonitoringReal Behavior [7]:Fitness analysis: answers the question "isthe observed process complying with theprocess flow specified in the model?" Appropriateness analysis: answers thequestion "does the process model describethe observed process appropriately?"Nevertheless, calculating some fitness indexto a particular model will not be enoughwhen doing analysis or audits; in thosesituations we will need ways to do complexconsultations of the log [8]. To be able toknow in which situations activity A is executedbefore activity B, or when did operator Xexecutes activities A and B, will be of great

importance to unveil violations of businessrules or policies that govern the execution ofthe process.

If these techniques are applied to ITSMprocesses, we can provide an interestingapplication to ensure segregation of dutiesin change management for thoseorganizations needing to comply with SOXor similar regulations. Next step would becontinuously monitoring of these rules [9].

4. ConclusionsProcess mining is presented as a set of toolsthat facilitate to a large degree processowners’ and process managers’ tasks, fromacquiring knowledge about the real behaviourof the process to audits and continuousimprovement. They allow many analysesthat would be practically impossible orextremely costly to perform using traditionalstrategies like reporting, dashboarding ormeasuring indicators.

Although, generally speaking, one of thebiggest difficulties we find in process miningis the lack of information or logs to analyze,in the specific area of IT service management

“ ”



monograph

Figure 11. Process Variants.

Start

Reassigned

Registered

In Progress

Solved

Closed

0,816

0,184 0,804

0,196

0,901

0,403

0,597 0,927

0,073

0,099

Figure 10. Simplified Márkov Chain.

29



References

[1] Jan van Bon. IT Service Management Global BestPractices, Volume 1. NL, NL: Van Haren Publishing,2008.[2] Rob England. Plus! The Standard+CaseApproach. Wellington, NZ: CreateSpace, 2013.[3] IEEE Task Force on Process Mining. ProcessMining Manifesto (in 12 languages). <http://www.win .tue .n l/ ieee tfpm/doku.php?id=shared:process_mining_manifesto>.[4] Ian M. Clayton. USMBOK - The Guide to theUniversal Service Management Body of Knowledge.CA, US: Service Management 101, 2012.[5] Marco Aniceto Vaz, Jano Moreira de Souza,Luciano Terres, Pedro Miguel Esposito. A CaseStudy on Clustering and Mining Business Processesfrom a University, 2011.[6] Wil M.P. van der Aalst et al. Auditing 2.0: UsingProcess Mining to Support Tomorrow’s Auditor, 2010.<http://bpmcenter.org/wp-content/uploads/reports/2010/BPM-10-07.pdf>.[7] Anne Rozinat, W.M.P. van der Aalst. ConformanceChecking of Processes Based on Monitoring RealBehavior, 2008. <http://wwwis .win.tue.nl/~wvdaalst/publications/p436.pdf>.[8] W.M.P. van der Aalst, H.T. de Beer, B.F. vanDongen. Process Mining and Verification ofProperties: An Approach based on Temporal Logic,2005.[9] Linh Thao Ly, Stefanie Rinderle-Ma, DavidKnuplesch, Peter Dadam. Monitoring BusinessProcess Compliance Using Compliance Rule Graphs,2011. <http://dbis.eprints.uni-ulm.de/768/1/paper.pdf>.

this is not a problem. Here ITSM processmanagement tools keep logs that can beused for process mining and can defineauditable fields to be traced giving usdifferent perspectives or dimensions ofanalysis.

Therefore, process mining stands out as apowerful and adequate tool to support ITSMpractices in i ts permanent quest forimprovement opportunities, both forprocesses and services of the managementsystem.

http://www.

http://bpmcenter.org/wp-content/uploads/

http://wwwis.win.tue.nl/

http://dbis.eprints.uni-ulm.de/768/1/



monograph

1. Introduction1. Introduction1. Introduction1. Introduction1. IntroductionAs the role of Big Data gains prevalence inthis information-driven era [1][2][3],businesses the world over are constantlysearching for ways to take advantage ofthese potentially valuable resources. The2012 Business Processing IntelligenceChallenge (BPIC, 2012) is an exercise inanalyzing one such data set using acombination of commercial, proprietary, andopen-source tools, and combining these withcreative insights to better understand therole of process mining in the modernworkplace.

1.1. Approach and Scope1.1. Approach and Scope1.1. Approach and Scope1.1. Approach and Scope1.1. Approach and ScopeThe situation depicted in BPIC 2012 focuseson the loan and overdraft approvals process ofa real-world financial institution in theNetherlands. In our analysis of this information,we sought to understand the underlying busi-ness processes in great detail and at multiplelevels of granularity. We also sought to identifyany opportunities for improving efficiency andeffectiveness of the overall process. Specifically,we attempted to investigate the following areasin detail:Develop a thorough understanding of thedata and the underlying process. Understand critical activities and decisionpoints.Map the lifecycle of a loan applicationfrom start to eventual disposition.Identify any resource-level differences inperformance and opportunities for processinterventions.

As newcomers to process mining, we atCKM Advisors wanted to use thisopportunity to put into practice ourlearning in this discipline. We alsoattempted to combine process mining toolswith traditional analytical methods to builda more complete picture. We are certainthat with experience, our approach willbecome more refined and increasinglydriven by methods developed specificallyfor process mining.

Our attempt was to be as broad as possiblein our analysis and delve deep where wecould. While we have done detailed analysisin a few areas, we have not covered allpossible areas of process mining in ouranalysis. Any areas that we did not cover (forexample, social network analysis) are drivensolely by our own comfort and familiarity

Process Mining-DrivenOptimization of a Consumer

Loan Approvals Process

Arjel Bautista, Lalit Wangikar,S.M. Kumail AkbarCKM Advisors, 711 Third Avenue, Suite1806, New York, NY, USA

<{abautista,lwangikar,sakbar}@ckmadvisors.com><{abautista,lwangikar,sakbar}@ckmadvisors.com><{abautista,lwangikar,sakbar}@ckmadvisors.com><{abautista,lwangikar,sakbar}@ckmadvisors.com><{abautista,lwangikar,sakbar}@ckmadvisors.com>

Abstract: An event log (262,200 events; 13,087 cases) of the loan and overdraft approvals processfrom a bank in the Netherlands was analyzed using a number of analytical techniques. Through acombination of spreadsheet-based approaches, process mining capabilities and exploratory analytics,we examined the data in great detail and at multiple levels of granularity. We present our findings onhow we developed a deep understanding of the process, assessed potential areas of efficiencyimprovement and identified opportunities to make knowledge-based predictions about the eventualoutcome of a loan application. We also discuss unique challenges of working with such data, andopportunities for enhancing the impact of such analyses by incorporating additional data elements.

Keywords: Big Data, Business Process Intelligence, Data Analytics, Process Mining.

Authors

Arjel Bautista is a consultant at CKM Advisors, involved in the development of innovative process re-engineering and analytical research techniques within the firm. In his projects, Arjel has deployed acombination of state-of-the-art data mining tools and traditional strategic analysis to solve a variety ofproblems relating to business processes. He has also developed strategies for the analysis ofunstructured text and other non-traditional data sources. Arjel holds Masters and Doctorate degrees inChemistry from Yale University, and a Bachelor’s degree in Biochemistry from UC San Diego.

Lalit Wangikar is a Partner at CKM Advisors. As a consultant, Lalit has advised clients primarily in thefinancial services sector, insurance, and payment services industries. His primary area of expertise isuse of Big Data and Analytics for driving business impact across all key business areas such asmarketing, risk, operations and compliance. He has worked with clients in North America, UK, Singaporeand India. Prior to joining CKM Advisors, Lalit ran Decision Analytics practice for EXL Service / Inductis.Prior to that he worked as a consultant with Deloitte Consulting and Mitchell Madison Group where headvised clients in banking and capital markets verticals.

Syed M. Kumail Akbar is a Consultant at CKM Advisors where he is a member of the Analytics Teamand assists in data mining, process mapping and predictive analytics. In the past, he has worked onstrategy and operations projects in the financial services industry. Before joining CKM, Syed workedas a research assistant in both the Quantitative Analysis Center and the Physics Department at WesleyanUniversity. He also co-founded Possibilities Pakistan, a Non-Governmental Organization dedicated toproviding access to college counseling for high school students in Pakistan. Syed holds a BA in Physicsand Mathematics-Economics from Wesleyan University.

with the subject matter, and not necessarilya limitation of the data.

2. Materials and Methods2. Materials and Methods2. Materials and Methods2. Materials and Methods2. Materials and Methods2.1. Understanding the Data2.1. Understanding the Data2.1. Understanding the Data2.1. Understanding the Data2.1. Understanding the DataThe data captures process events for13,087 loan / overdraft applications over asix month period, between October 2011and March 2012. The event log iscomprised of a total of 262,200 eventswithin these cases, starting with a customersubmitting an application and ending witheventual conclusion of that applicationinto an approval, cancellation or rejection(declined). Each application contains asingle attribute, AMOUNT_REQ, whichindicates the amount requested by theapplicant. For each event, the extract showsthe type of event, lifecycle stage (Schedu-

le, Start, Complete), a resource indicatorand time of completion.

The events themselves describe steps alongthe approvals process and are classifiedintothree major types. Table 1 Table 1 Table 1 Table 1 Table 1 shows the eventtypes and our understanding of what theevents mean.

By itself, the event log is a complicated massof information from which it is difficult todraw logical conclusions. Therefore, as otherresearchers have noted [4][5], it is necessaryto subject the log to some degree ofpreprocessing in order to reduce its overallcomplexity, make visual connections betweenthe steps contained within, and aid inanalyzing and optimizing the businessconcepts at hand.

31



Although we were provided a rigorously pre-processed event log that could be analyzedin process mining tools quiet readily, weprocessed the data further to build tailoredextracts for various analytical purposes.

2.2. Tools Used for AnalysisDisco: We procured an evaluation versionof Disco 1.0.0 (Fluxicon) and used it in theexportation of data into formats suitable forspreadsheet analysis. Disco was especiallyhelpful in facilitating visualization of typicalprocess flows and exceptions.Microsoft Excel: We used Excel 2010(Microsoft) to foster deeper exploration into

the preprocessed data. Excel was especiallyhelpful for performing basic and advancedmathematical functions and data sorting,two capabilities notably absent from theDisco application.CART: We used an evaluation version ofthe CART implementation (Salford Systems)for conducting preliminary segmentationanalysis of the loan applications to assessopportunities for prioritizing work effort.

3. Understanding the Processin Detail3.1. Simplifying the Event LogUpon obtaining the BPIC 2012 event log,

we first attempted to reduce its overallcomplexity by identifying and removingredundant events. For the purposes of thisanalysis, an event is considered redundant ifit occurs concurrently with or subsequentlyafter another event, such that the timebetween the two events is minimal (a fewseconds at most) with respect to the timeframe of the case as a whole.

Initial analysis of the raw data in Discorevealed a total of 4,366 event order variantsamong the 13,087 cases represented. Wesurmised that removal of even one sequenceof redundant events could result in a

As other researchers have noted, it is necessary tosubject the log to some degree of preprocessing

in order to reduce its overall complexity

Type Description“A_” Application Events

Refers to states of the application itself. After a customer initiates an application, bank resources follow up to complete the application where needed and facilitate decisions on applications.

“O_” Offer Events

Refers to states of an offer communicated to the customer.

“W_” Work Events

Refers to states of work items that occur during the approval process. These events capture most of the manual effort exerted by Bank’s resources during the application approval process. The events describe efforts during various stages of the application process.

- W_Afhandelen leads: Following up on incomplete initial submissions

- W_Completeren aanvraag: Completing pre-accepted applications

- W_Nabellen offertes: Follow up after transmitting offers to qualified applicants

- W_Valideren aanvraag: Assessing the application

- W_Nabellen incomplete dossiers: Seeking additional information during assessment phase

- W_Beoordelen fraude: Investigating suspect fraud cases

W_Wijzigen contractgegevens: Modifying approved contracts

Table 1. Event Names and Descriptions.

Figure 1. Standardized Case Flow for Approved Applications.

“”



monograph

Table 2. Potential Redundancies in the Event Log.

Redundant Events Occurrence A_PARTLYSUBMITTED Immediately after A_SUBMITTED in all 13,087 cases. O_SELECTED O_CREATED

Both in quick succession prior to O_SENT for the 5,015 cases selected to receive offers. In certain cases, O_CANCELLED (974 instances), A_FINALIZED (2,907 instances) or W_Nabellen offertes-SCHEDULE (1 instance) occur between O_SELECTED and O_CREATED in the offer creation process.

O_ACCEPTED A_REGISTERED A_ACTIVATED

All three occur, in random order, with A_APPROVED for the 2,246 successful applications. In certain cases, O_ACCEPTED is interspersed among these events.

significant reduction in the number ofvariants. This simplification is compoundedfurther when the number of removed variantsis multiplied by others occurring downstreamof the initial event.

Additionally, we eliminated two O-type events(O_CANCELLED and O_DECLINED)which occur simultaneously with A_CANCELLED and A_DECLINED,respectively. W-type events were notconsidered for removal, as their transitionphases are crucial for calculating work timespent per case. With the redundant eventsremoved from the event log, the number ofvariants was reduced to 3,346 – an improvementfrom the unfiltered data set of nearly 25%.Such consolidation can aid in simplifying theprocess data and facilitating quicker analysis.The variant complexity could be further reducedby interviewing process experts at the bank tohelp consolidate events that occur togetherand sequencing variations not critical for busi-ness analysis.

3.2. Determining Standard CaseFlowWe next sought to determine the standardcase flow for a successful application, againstwhich all other cases could then becompared. We did this by loading thesimplified project into Disco and filtering allcases for the attribute A_APPROVED. Wethen set both the activities and pathsthresholds to the most rigorous level (0%),which resulted in an idealized depiction ofthe path from submission to approval (seeFigure 1).

3.3. Understanding ApplicationOutcomesBefore launching into a more detailed reviewof the data, we found it necessary to defineendpoint outcomes for all 13,087applications. Using the standardized caseflow (see Figure 1), we determined that allapplications are subject to one of four fatesat each stage of the approvals process: Advancement to nex t s tage : The

application proceeds to the next stage of theprocess.Approved: Applications that are approvedand where the customer has accepted thebank’s offer are considered a success and aretagged as Approved, with the end pointdepicted by the event A_APPROVED.Cancelled: The application is cancelledby the bank or at the request of the customer.Cancelled applications have a final endpointof A_CANCELLED.Denied: The applicant, after having beensubject to review, is deemed unfit to receivethe requested loan or overdraft. Deniedapplications have a final endpoint ofA_DECLINED.

We leveraged Disco’s filtering algorithm todefine a set of possible endpoint behaviors.399 cases were classified unresolved as theywere in progress at the time the data wascollected (i.e., did not contain endpoints ofA_DECLINED, A_CANCELLED orA_APPROVED).

Figure 2 shows a high-level process flow thatmarks how the cases are disposed at each of thekey process steps. This analysis provides ususeful insights on the overall business impactof this process as well as overall case flowthrough critical process steps.

We observe several baseline performancecharacteristics from Figure 2:~26% of applications are instantly declined(3,429 out of 13,087); indicating tightscreening criteria for moving an applicationbeyond the starting point.~24% of the remaining (2,290 out of9,658) are declined after initial lead followup, indicating a continuous risk selectionprocess at play.754 of the 3,254 applications that go tovalidation stage (~23%) are declined,indicating possibilities for tightening upfrontscrutiny at application or offer stages.

4. Assessing Process Perfor-mance

4.1. Case-Level Analysis4.1.1. Case Endpoint vs. Overall DurationIn an effort to evaluate how the fate of aparticular case changes with overall duration,we prepared a plot of these two variables andoverlaid upon it the cumulative amount ofwork time amassed over the life of thesecases. We excluded 3,429 cases that areinstantly declined on initial applicationsubmission, as no effort is spent on these.We endeavored to visualize the point atwhich exertion of additional effort yieldsminimal or no return in the form of completed(closed) applications.

Figure 3 shows a lifecycle view of allapplications, indexed to the time ofsubmission. As shown in the figure, withinthe first seven days applications continue tomove forward or are declined. At Day 7, thenumber of approved cases begins to rise,suggesting this is the minimal number ofdays required to fulfill the steps in the stan-dard case flow (see Figure 1).

Approvals continue until ~Day 23, at whichpoint >80% of all cases that are eventuallyapproved have been closed and registered.There is a significant jump in the number ofcancelled applications at Day 30, as inactivecases receiving no response from theapplicant after stalling in the bottleneckstages Completeren aanvraag or Nabellenoffertes are cancelled, likely according tobank policies.

This raises the interesting question of whenthe bank should stop any proactive efforts toconvert an application to a loan, and whetherthe bank should treat customers differentlybased on behaviors that indicate likelihoodof eventual approval. For example, the bankexerts an additional 380+ person days ofeffort between Days 23 and 31, only tocancel a majority of pending cases at theconclusion of this period. With additionaldata on customer profitability or lifetimevalue and comparative cost of additionaleffort, one can determine an optimal point

33



in the process where additional effort oncases that have not reached a certain stagecarries no positive value.

4.1.2. Segmenting Cases by AmountRequestedAs each case is associated with an amountrequested by the applicant, we found itappropriate to arrange them into segments

of roughly equal number, sorted by totalrequested value. We first removed theinstantly declined cases by filtering themthrough Disco, as these are immediatelyresolved upon submission and do not haveany additional effort or steps in the process.The resultant 9,658 cases (which includethose in progress) were then split into decilesof 965-966 cases each. Each decile was

further segmented by classifying the casesaccording to eventual outcome, and theensuing trends were examined for correlationof approval percentage with amountsrequested (see Figure 4).

We immediately observed the highestapproval percentages in deciles 3 and 6,whose cases contained request ranges of

Figure 2. Key Process Steps and Application Volume Flow.

Figure 3. Distribution of Cases by Eventual Outcome and Duration, with Cumulative Work Effort. Gray: Remaining InProgress, Blue: Cumulative Declined, Red: Cumulative Cancelled, Green: Cumulative Approved. Excludes 3,472 InstantlyDeclined Cases.

1. A_SUBMITTED 13,087 cases

(100% of submissions)

2. A_PARTLYSUBMITTED 13,087 cases


3. A_PREACCEPTED 7,367 cases


4. A_ACCEPTED 5,113 cases


5. A_FINALIZED 5,015 cases


6. O_SELECTED O_CREATED

O_SENT 5,015 cases


7. O_SENT_BACK 3,254 cases


8. A_APPROVED A_ACTIVATED

A_REGISTERED 2,246 cases


Afhandelen leads

Declined instantly: 3,429 Declined after call : 2,290 Cancelled: 1 Unresolved: 0

Declined: 1,085 Cancel led: 1,100 Unresolved: 69

Completeren aanvraag

Decl ined: 29 Cancelled: 66 Unresolved: 3

Valideren aanvraag

Declined: 48 Cancelled: 1,482 Unresolved: 231

Declined: 754 Cancelled: 158 Unresolved: 96

Final ization of applications

Customer response to mailed offers

Key Process Steps and Distribution of Application Volume



monograph

These results suggest that an office of specialists performingsingle activities may be better suited to handle a larger amount of cases

than an army of resources charged with a myriad of tasks

Figure 4. Endpoints of Cases (Left Axis), Segmented by Amounts Requested by the Applicant. Green:Approved, Red: Cancelled, Blue: Declined, Violet: In Progress.

5,000-6,000 and 10,000-14,000,respectively. The exact reason for thispattern is unclear; however, we speculatethat typical applicants will often choose a"round" number upon which to base theirrequests (indeed, this is reflected in thethree most frequent request values in thedata set: 5,000, 10,000 and 15,000).Perhaps a certain risk threshold change inthe bank’s approval process causes a stepchange in approval percentages.

4.2. Event-Level Analysis4.2.1. Calculating Event DurationWe sought to gain a detailed understandingof the work activities embedded in theapprovals process, specifically those thatcontribute a significant amount of time orresources toward resolution. The format ofdata made available in this case was notreadily amenable to this analysis.

We used Excel to manipulate the event-leveldata as provided and defined work time(presumably actual effort expended by humanresources) for each event as the duration fromstart to finish (START / COMPLETE

transitions, respectively). In contrast, wait timewas defined as the latency between eventscheduling and commencement (SCHEDU-LE / START), or the time elapsed betweentwo instances of a single activity type as well asbetween COMPLETE of one event andSTART of another:

As shown in Table 3 , two activities,Completeren aanvraag and NabellenOffertes, contribute a significant amount tothe total case time represented in the eventlog. The accumulated wait time attributedto each of these two events can reach as highas 30+ days per case, as the bank presumablymakes numerous attempts to reach theapplicant until contact is made.

On closer inspection of the data, we realizedthat the bank attempts to contact thecustomer multiple times per day until Day30 in order to complete the application, aswell as to follow up on offers that have beenextended but not yet replied to.

4.2.2. Initial vs. Follow-Up ActivitiesThe average work time spent performing

each event changes whether the bank isconducting it for the first time, or followingup on a previous step in a particular case(see Figure 5).

Some differences in initial and follow-upinstances are minimal (Valideren aanvraag),while others are more pronounced(Beoordelen fraude). In the case of Validerenaanvraag, the bank is likely to be as thoroughas possible during the validation process,regardless of how many times it has previouslyviewed an application. On the other hand,when investigating suspect cases for fraud,the bank may already have come to apreliminary conclusion regarding theapplication and is merely using the follow-up instance to justify its decision.

Follow-up instances for those events in whichthe bank must contact the applicant oftenhave smaller average work times than theirinitial counterparts, as these activities arethose most likely to become trapped inrepeating loops, perhaps due to non-responsive customers. One can leverage suchevent data to understand customer behavior

“”

35



These results suggest that an office of specialists performingsingle activities may be better suited to handle a larger amount of cases

than an army of resources charged with a myriad of tasks

and assess potential usefulness of such datafor work prioritization.

4.3. Resource-Level Analysis4.3.1. Specialist vs. Generalist-DrivenWork ActivitiesWe profiled 48 resources that handled atleast 100 total events (excluding resource112, as this resource does not handle workevents outside of scheduling) and computedwork volume by number of events handledby each. We observed nine resources thatspent >50% of their effort on Validerenaanvraag, and a distinct group that mostlyperformed activi ties of Completerenaanvraag, Nabellen offertes and Nabellenincomplete dossiers. It appears validation isperformed by a dedicated team of specialistsfocused on this work type, while customer-facing activities such as Completerenaanvraag, Nabellen offertes and Nabellenincomplete dossiers might require similarski lls that are performed by anotherspecialized group.

We next examined the performance ofresources identified as specialists (>50% ofwork events of one single type) or contributors

(25-50%) and compared them with those whoplayed only minor roles in similar activities. Todo this, we took the total work time accumulatedin an activity by resources belonging to aparticular category and calculated averagesbased on the total number of work eventsperformed in that category. Two activities,Nabellen offertes and Valideren aanvraag, didnot contain specialists and contributors,respectively, and so these categories wereomitted from the comparisons for theseactivities.

As depicted in Figure 6, specialists spentless time per event instance than theircounterparts, in some cases performing tasksup to 80% more efficiently than minorplayers. The performance of contributors isfar less consistent, however, exhibiting ave-rage work times / case that are both higher(Afhandelen leads, Nabellen offertes) andlower (Completeren aanvraag, Nabellenincomplete dossiers) than those of the minorplayers. These results suggest that an officeof specialists performing single activitiesmay be better suited to handle a largeramount of cases than an army of resourcescharged with a myriad of tasks.

4.4. Leveraging Behavioral Datafor Work Effort PrioritizationOne of the objectives of process mining is toidentify opportunities for driving processeffectiveness; that is, achieving better busi-ness outcomes for the same or less effort ina shorter or equal time period. In particular,we sought to use process event data collectedon an application to better prioritize workefforts. Specifically, we set out to understandif this could be done on the fifth day sincethe application was submitted.

To do this, we created an application-leveldata set for 5,255 cases that lasted >4 daysand where the end outcome is known. Forthese applications, we captured all eventsfrom submission until the end of day 4 andused them to calculate the following:What stage the application had reached,and if it had been completed.How much effort had already gone into theapplication.How many events of each kind had alreadybeen logged.If the application required lead follow up.

We attempted to find key segments in this

“”

Figure 5. Comparison of Average Work Times, Initial vs. Follow-Up Event Instances.



monograph

Afhandelen Leads

Beoordelen Fraude

Completeren aanvraag

Nabellen Offertes

Nabellen Incomplete Dossiers

Valideren Aanvraag

Work Time:

Approved 13,659 23 45,909 68,473 89,204 121,099 Cancelled 14,601 2 119,497 94,601 25,633 7,775 Declined 67,560

2,471 63,052 30,870 26,993 29,946

Wait Time:

Approved 198,916 8,456 1,873,537 34,972,224 5,980,887 10,537,938 Cancelled 300,062 28,763 16,582,465 42,630,195 2,006,774 678,105 Declined 986,421 236,115 3,294,367 13,542,054 1,001,354 3,227,252 Table 4. Potential Time Savings Associated with Conversion of Current Generalists to Single-Activity Specialists. (*) None ofthe resources performing Nabellen offertes were identified as specialists; therefore mean efficiency for area contributors wasused instead.

population that were highly likely to beapproved and accepted OR highly likely tobe cancelled or declined. We did this bysubjecting the data to segmentation usingthe Classification and Regression Tree(CART) technique (see Figure 7).

The partial tree above shows two segmentswith <6% approval rates: Terminal Nodes 1and 14, consisting of a total of 1,018 caseswith only 49 eventual approvals. Node 14,consisting of 818 cases, shows incompleteapplications where the bank could not pre-pare an offer for the customers by the end ofDay 4. Such "slow-moving" applications hada <6% chance of being approved, comparedto an average of 42% for the entire group of5,255. Node 1 has applications that aretouched by 3 or fewer resources; with 112being one of them. This might be anotherindicator for a slow-moving application. Suchapplications have virtually no likelihood ofbeing approved in the end.

One could repeat this analysis at differentstages in the lifecycle of the application to helpwith effort prioritization. This preliminaryanalysis indicates significant potential to redu-ce effort on cases that might not reach thedesired end state. Further analysis withcustomer demographics, application details,and additional information on resources whowork on such cases will help refine the findingsand suggest specific action steps to improveprocess effectiveness.

5. Discussion5.1. Working with Data Challenges5.1.1. Managing Event ComplexityThe optimization of the loan approvalsprocess is an exercise in streamlining eachstep of the end-to-end operation. One nota-

ble point that creates challenges in buildinga streamlined process view with automatedprocess mining tools is the amount andcomplexity of data captured. If such data isnot used with accompanying businessjudgment, one can get lost in apparentcomplexity (>4,000 process variants for aprocess that has 6-7 key steps). Weillustrated this point above in our discussionregarding redundant events. We recommenddealing with such complexities at the time ofanalysis, using process knowledge and goodbusiness judgment, by performing additionaldata pre-processing steps.

It is also critical to scrutinize event data upfront to understand all quirks and to buildways of addressing these. For example, acomparison of the number of START andCOMPLETE transitions for W-type eventsin the data set reveals the existence of 1,037more COMPLETE transitions than STARTtransitions. As the time stamps for theseevents are unique with respect to others inthe same Case ID, they have the potential togreatly confuse the summation of work andwait times for a particular case and forresources within the institution. We denotedthese as systems errors and worked with thefirst COMPLETE following a START asthe "correct" one for a given work event type.In a real project, we would validate ourassumption by deeper review of how suchinstances arise in the system and using thatunderstanding to treat these observationscorrectly in our analysis.

As described in Section 3.1, the eventlog would also benefit from consolidationof events that happen concurrently, suchas those that occur when successfulapplications are approved (A_APPROVED,

A_REGISTERED and A_ACTIVATED).This would not only decrease the overallfile size (which becomes important as thevolume of data grows), but also reduce thecomplexity of the initial log.

5.2. Potential Benefits of Resource5.2.1. Re-Deployment Recasting Gene-ral ists as SpecialistsAs mentioned previously, the tasks involvedin the loan approvals process are performedby a mixture of specialists and generalists.Through our analysis we concluded that thebank might benefit from specialization oflabor, whereby current resources arereassigned to single posts in order to maximizeefficiency. In Table 4, we show potentialgains to be made through such restructuring.If the bank can improve performance ofeveryone executing a task to the same levelsas specialists, we estimate a substantialoverall time saving.

We also evaluated the potential savingsassociated with downsizing the overall poolof resources assigned to these tasks. Usingthe average amount of work time for resourceshandling >100 total events (approximately16,000 minutes; again excluding resource112), we estimate opportunity to reduce thework effort by 35%:

5.3. The Power of Addit ion alInformation Additional5.3.1. Case-Level AttributesIn its raw form, the BPIC 2012 event log is agold mine of information that, once decoded,provides a detailed view of a consumer loanapprovals process. However, this informationwould be greatly strengthened by the additionof a few key data points. As each case carrieswith it a single attribute – the amount requested

37



Figure 6. Work Time per Event, Specialists / Contributors vs. Minor Players.

by the applicant – we have no way of knowingwhy certain cases are approved while otherswith identical request amounts and paths arerejected. Therefore it would be useful to knowcustomer demographics, current or pastrelationships with the customers, and additionaldetails about the resources that execute theseprocesses. With this information, we can buildspecific recommendations for changing theprocess and more accurately estimate likelybenefits of such changes.

5.3.2. Customer Profitability andOperating CostsA final set of data notably absent from theprovided BPIC 2012 log are the overall costsassociated with the loan approvals processand value of each loan application to thebank. It would be worthwhile to understandhow much it costs to operate each resource,and whether this cost varies based on theactivities they perform or the number ofevents they participate in. This informationwould also allow us to calculate an averageacquisition cost for each applicant, andsubsequently understand the minimumthreshold below which it does not makeeconomic sense to approve an incomingloan request.

6. ConclusionsThrough comprehensive analysis of the BPIC

2012 event log, we converted a fairly complexdata set into a clearly interpretable, end-to-end workflow for a loan and overdraftapprovals process. We examined the data atmultiple levels of granularity, uncoveringinteresting insights at all levels. Through ourwork we uncovered potential improvementsin a number of areas, including revision ofautomated processes, restructuring of keyresources, and evaluation of current casehandling procedures. Indeed, future analysiswould be greatly aided by the inclusion ofadditional data, such as customer information,governing policies, operating costs and relativecustomer value.

As part of our analysis, we performed arudimentary predictive exercise whereby wedetermined the current status of cases atvarious days in the approvals process andquantified their chances of approval,cancellation, or denial. This allowed us toestimate the fate of a case based on itsperformance and tailor the overall processto minimize stalling at traditional casebottlenecks. While preliminary in its nature,this opens the door to more elaborate futuremodeling exercises, perhaps driven bysophisticated computer algorithms.

While we covered several areas in this exercise,there are others where we did not conduct

detailed analysis. The bank would findsignificant additional benefits from exploringsuch additional areas, for example, socialnetwork analysis.

In conclusion, the procedures highlightedby the BPIC 2012 elaborate the role andimportance of process mining in the modernworkplace. Steps that were previouslyelucidated only after years of practice andobservation can now be examined using asample set of existing data. As the era of BigData continues its march toward the busi-ness world, we foresee process mining as acentral player in the charge toward turningquestions into solutions and problems intosustainable profit.

AcknowledgementsWe are grateful to the financial institution that madethis data available for study. Special thanks toFluxicon for providing us with an evaluation copy ofDisco and an accompanying copy of the BPIC 2012data set. We also thank Tom Metzger, NicholasHartman, Rolf Thrane and Pierre Buhler for helpfuldiscussions and insights. Our thanks also to SalfordSystems, who made their software available in ademonstration version.



monograph

References

[1] W. Van der Aalst, A. Adriansyah, A.K. Alves deMedeiros, F. Arcieri, T. Baier et al. Process MiningManifesto. Business Process ManagementWorkshops 2011, Lecture Notes in BusinessInformation Processing, vol. 99. Springer-Verlag,2011.[2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,C. Roxburgh, A. Byers. Big Data: The Next Frontier forInnovation, Competition, and Productivity. McKinseyGlobal Institute, 2011. <http://www.mckinsey.com/insights/business_ technology/big_data_the_next_frontier_for_ innovation>.[3] R. Adduci, D. Blue, G. Chiarello, J. Chickering, D.Mavroyiannis et al. Big Data: Big Opportunities toCreate Business Value. Technical report, InformationIntelligence Group, EMC Corporation, 2011.

Figure 7. Partial View, CART Segmentation Tree.

It is also critical to scrutinize event data up front to understandall quirks and to build ways of addressing these“ ”

[4] R.P.J.C. Bose, W.M.P. van der Aalst. Analysis ofPatient Treatment Procedures: The BPI ChallengeCase Study. First International Business ProcessIntelligence Challenge, 2011. <http://bpmcenter.org/wp-conten t/uploads/reports/2011/BPM-11-18.pdf>.[5] W.M.P. van der Aalst. Process Mining: Discovery,Conformance and Enhancement of BusinessProcesses. Springer, 2011. ISBN-10: 3642193447.

http://www.mckinsey.com/

http://bpmcenter.org/

39



Detection of Temporal Changesin Business Processes

Using Clustering Techniques

1. IntroductionIn a globalized and hyper-connected world,organizations need to be constantlychanging in order to adapt to the needs oftheir business environment, which impliesthat their business processes must also beconstantly changing.

To illustrate this, consider the case of atoy store whose sales practices may changeradically depending on whether they areexecuted over the Christmas period orduring vacations, principally due to thechanges in the volume of the demand. Forthe Christmas season, they might employa process that prioritizes efficiency andvolume (throughput), while during thevacation season they might use a processwhich focuses on the quality of theircustomer service.

In this example, it is easy to identify theperiods in which demand changes, and thusit is possible that the process manager (theperson in charge of the process management)will have a clear understanding of the changesthat the sales process undergoes over time.However, if an organization has a processthat occurs in various independent offices,for example, offices that are located indifferent geographic locations, the evolutionof the changes in the process at each officewill not be so evident to the central processmanager. Moreover, the changes in eachoffice could be different and could happenat different moments in time.

Understanding the changes that areoccurring in the different offices couldhelp to better understand how to improvethe overall design of the process. Beingable to understand these changes andmodel the different versions of the processallow the process manager to have moreaccurate and complete information in orderto make coherent decisions which result inbetter service or efficiency.

To achieve the aforementioned goals, variousadvances have been made in the disciplineof Business Process Management (BPM), adiscipline which combines knowledge aboutinformation technology and managementtechniques, which are then applied to busi-ness operation processes, with the goal ofimproving efficiency [1]. Within BPM,process mining has positioned itself as an

emerging discipline, providing a set of toolsthat help to analyze and improve businessprocesses [1], based on analyzing event logsstored by information systems during theexecution of a process. However, despite theadvances made in this field, there still existsa great chal lenge, which consists ofincorporating the fact that processes changeover time, a concept which is known in theliterature as Concept Drift [2].

Depending on the nature of the change, itis possible to distinguish different types ofConcept Drift, including: Sudden Drift(sudden and significant change to thedefinition of the process), Gradual Drift(gradual change to the definition of theprocess, allowing for the simultaneousexistence of the two definitions), andIncremental Drift (the evolution of theprocess that occurs through smal l ,consecutive changes to the definition ofthe model).

Despite the existence of all these Drift types,the existing techniques of process miningare limited to finding the points in time whenthe process changes, centering principallyon Sudden Drift changes. The problem withthis limitation is that in practice it is not asfrequent for business processes to show asudden change in definition.

If we apply the existing process miningapproaches to processes that have differentkinds of changes other than Sudden Drift,we could end up with results which makelittle sense to the business.

In this document we propose a new approach,which allows the discovery of the variousversions of a process when there are differentkinds of Drift, helping to understand howthe process behaves over time. To accomplishthis task, existing process mining clusteringtechniques are used, but with timeincorporated as an additional factor to the

Abstract: Nowadays, organizations need to be constantly evolving in order to adjust to the needs oftheir business environment. These changes are reflected in their business processes, for example:due to seasonal changes, a supermarket’s demand will vary greatly during different months of theyear, which means product supply and/or re-stocking needs will be different during different times ofthe year. One way to analyze a process in depth and understand how it is really executed in practiceover time, is on the basis of an analysis of past event logs stored in information systems, known asprocess mining. However, currently most of the techniques that exist to analyze and improve processesassume that process logs are in a steady state, in other words, that the processes do not change overtime, which in practice is quite unrealistic given the dynamic nature of organizations. This documentpresents in detail the proposed technique and a set of experiments that reflect how our proposaldelivers better results than existing clustering techniques.

Keywords: Concept Drift, Clustering, Process Mining, Temporal Dimension.

Authors

Daniela Luengo was born in Santiago, Chile. She is an Industrial Civil Engineer with a major in InformationTechnology from Pontificia Universidad Católica de Chile. She also received the academic degree ofMaster of Science in Engineering from the same university. Additionally she has an academic certificatein Physical Activity, Sport, Health and Education. From 2009 to 2013 she worked at the InformationTechnology Research Center of the Pontificia Universidad Católica de Chile (CETIUC), as an analystand consultant in the area of process management excellence, performing process mapping, processimprovement and several researches. Her current interests are focused on applying her knowledge inthe public sector of her country.

Marcos Sepúlveda was born in Santiago, Chile. He received his Ph.D. on Computer Science from thePontificia Universidad Católica de Chile in 1995. He made a postdoctoral research in the ETH Zürich,Switzerland. He is an associate professor in the Computer Science Department at the Pontificia Univer-sidad Católica de Chile since 2001. He is also the director of the Information Technology ResearchCenter of his university (CETIUC). His research interests are Process Mining, Business Process Modeling,Business Intelligence, and Information Systems Management.

Daniela Luengo, MarcosSepúlvedaComputer Science Department, School ofEngineering, Pontificia Universidad Católi-ca de Chile, Santiago (Chile)

<dl l uen go@u c.c l>,< mar cos@ in g.p uc .c l>



monograph

“”

control-flow perspective to generate thedif ferent clusters . Trace Clusteringtechniques are used, which unlike othermetrics-based techniques that measure thedistance between complete sequenceshaving linear complexity, allowing for thedelivery of results in a shorter time span[3].

The focus of our work contributes to theprocess analysis, allowing the process managerto have a more realistic vision of how theprocess behaves over different periods of time.With this approach it is possible to determinethe different versions of the process, thecharacteristics of each process, and to identifyin which moment the changes occur.

This article is organized in the followingway. Section 2 presents the related work.Section 3 describes base and the modifiedversion of the clustering method. Section 4presents experiments and results and finallythe conclusions and future work are presentedin section 5.

2. Related WorkProcess mining is a discipline that hasattracted major interest recently. This disci-pline assumes that the historical informationabout a process stored in informationsystems can be found in a dataset, known asan event log [4]. This event log contains pastinformation about the activities that wereperformed in each step of the process, where

each row of the log is composed of at leastone identifier (id) associated with each indi-vidual execution of the process, the name ofthe activity performed, the timestamp (dayand time when the activity occurred), and,optionally, additional information like theperson who carried out the activity or othersuch information. Additionally, in theliterature [3] an ordered list of activitiesinvoked by a specific execution of the processis defined as a trace of the execution.

Currently, one of the problems in processmining is that the developed algorithmsassume the existence of information relativeto a unique version of the process in theevent log. However, this is often not the case,which is why applying the process miningalgorithms to these logs leads to fairlyunrepresentative and/or very complicatedresults, which contribute little to the job ofanalyzing and improving the processes.

2.1. Clustering of the Event LogTo resolve the aforementioned problems inprocess mining, clustering techniques havebeen proposed for dividing the event logbefore the process mining techniques areapplied [5]. This would consist of dividingthe event log into homogenous clusters; inorder to later apply the process miningtechniques independently to each clusterand thus obtaining more representativemodels. Figure 1 shows the stage of logpreprocessing and then uses a discovery

technique as an example of a process miningtechnique.

To produce this clustering, it is necessary todefine a way of representing the traces, sothat it becomes possible to group them lateraccording to previously determined criteriaof similarity. There currently exist variousclustering techniques in process mining[6][3]. The majority of them principallyconsider information about the control-flowof activities. These techniques can beclassified into two categories:1) Techniques that transform the traces into avector space, in which each trace is convertedinto a vector. The dividing of the log can bedone using a variety of clustering techniques inthe vector space, like for example: Bag ofactivities, K-gram model [6], and Traceclustering [5]. However, these techniques havethe problem of lacking contextual information,which some have attempted to correct with theTrace clustering based on conserved patternstechnique [3].

2) Techniques that operate with the wholetrace. These techniques use metrics ofdistance like Levenshtein and Generic EditDistance [6], together with standardclustering techniques, assigning a cost to thedifference between traces.

The existing techniques for both categories,despite improving clustering through thecreation of structurally similar trace clusters,

Discovery

Event Log Clustering

Cluster A

Cluster B

Model A

Model B

Preprocessing stage of the event log

Discovery

Figure 1. Preprocessing Stage of the Event Log.

Currently, one of the problems in process mining is thatthe developed algorithms assume the existence of information relative

to a unique version of the process in the event log

41



“”

do not consider the temporal dimension ofthe process’s execution, nor how the processchanges over time.

2.2. The Concept Drift ChallengeIn BPM, Concept Drift refers to a situation inwhich a process has experienced changes in itsdesign within an analyzed period (yet the exactmoment in which the changes were producedis unknown). These changes can be due to avariety of factors, but are mainly due to thedynamic nature of the processes [7].

The study of Concept Drift in the area ofprocess mining has centered on processchanges in the control-flow perspective, andcan be of two kinds, momentary changes orpermanent changes, depending on theduration of the change.

When changes occur over short andinfrequent periods, they are consideredmomentary changes. These changes are alsoknown in processes jargon as process noiseor process anomalies.

On the other hand, permanent changes occurover more prolonged periods of time and/orwhen a considerable amount of instances isaffected by the changes, which signalschanges in the design process.

Our interest centers on the permanent changesin the control-flow perspective, which can bedivided into the following four categories: Sudden Drift: This refers to drasticchanges, meaning the way in which theprocess execution changes suddenly fromone moment to the next. Recurring Drift: When a process suffersperiodic changes, meaning a way of performingthe process is repeated again later. Gradual Drift: This refers to changes whichare not drastic, but rather at a moment whentwo versions of the process overlap, whichcorresponds to the transition from one versionof the process to another.

Incremental Drift: This is when a processhas small incremental changes. These typesof changes are more frequent in organizationsthat adopt agile BPM methodologies.

To solve the Concept Drift problem, newapproaches have evolved to analyze thedynamic nature of the processes.

Bose [2] proposes methods to manageConcept Drift by showing how the processchanges are indirectly reflected in the event logand that the detection of these changes isfeasible by examining the relationship betweenactivities. Different metrics have been definedto measure the relationship between activities.Based on these metrics, a statistical methodwas proposed whose basic idea is to considera successive series of values and investigate ifa significant difference between two seriesexists. If it does, this would correspond to aprocess change.

Stocker [8] also proposes a method to manageConcept Drift which considers the distancesbetween pairs of activities of different traces asa structural feature, in order to generatechronologically subsequent clusters.

Bose and Stocker’s approaches are limitedto determining the moment in time when theprocess changes, and thus center on suddenchanges and leave out other types of changes.

To resolve this, in an earlier article [9] weproposed an approach that makes use ofclustering techniques to discover the changesthat a process can experience over time, butwithout limiting ourselves to one particularkind of change. In that approach, thesimilarity among two traces is defined by thecontrol-flow information and by the momentin which each trace begins to operate.

In this article, we present an extension of theearlier work [9], after incorporating a new formof measuring the distance between two traces.

3. Extending Clustering Tech-niques to Incorporate the Tem-poral VariableAs was explained in the last section, theexisting approaches for dealing with ConceptDrift are not sufficiently effective at findingthe versions of a process when the processhas undergone different types of changes.To solve this problem, we look to the TraceClustering technique proposed by Bose [3]and based on conserved patterns, whichallows clustering the event log consideringeach trace’s sequence of activities.

Our work is based on this technique andextends it by incorporating an additionaltemporal variable to the other control-flowvariables used for clustering.

3.1. Trace Clustering Based onConserved PatternsThe basic idea proposed by Bose [3] is toconsider subsequences of activities thatrepeat in multiple traces as feature sets forthe implementation of clustering. Unlikethe K-gram approach that considerssubsequences of fixed size, in this approachthe subsequences can be of differentlengths. When two instances have asignificant number of subsequences incommon, it is assumed that they havestructural similarity and these instancesare assigned to the same cluster.

There are six types of subsequences,therefore, we will only give a formal definitionof MR, since these are the subsequencesthat we used to develop our approach,however the work could be extended to usethe other subsequences.

Maximal Repeat (MR): A MaximalRepeat in a sequence T is defined as asubsequence that occurs in a Maximal Pairin T. Intuitively, an MR corresponds to asubsequence of activities that is repeatedmore than once in the log.

Table 1 shows an example where existing MRin a sequence are determined. This techniqueconstructs a unique sequence starting from theevent log, which is obtained by connecting allthe traces, but with a delimiter placed amongthem. Then, the MR definition is applied tothis unique sequence. The set of all MRdiscovered in the sequence with more than oneactivity, is called a Feature set.

The study of Concept Drift in the area of process mininghas centered on process changes in the control-flow perspective,

and can be of two kinds, momentary changes orpermanent changes, depending on the duration of the change

Sequence Maximal Repeat Feature Set

bbbcd-bbbc-caa {a, b, c, bb, bbbc} {bb, bbbc}

Table 1. Example of Maximal Repeat and Feature Set.



monograph

“”

Structural Features

Time

BC 1BC

CC

AC

CC

AC

Structural Features

Based on the Feature Set, a matrix is createdthat allows the calculation of the distancebetween the different traces. Each row ofthe matrix corresponds to a trace and eachcolumn corresponds to a feature of theFeature Set. The values of the matrixcorrespond to the number of times that eachfeature is found in the various traces (seeTable 2). We will call this matrix theStructural Features Matrix.

This pattern-based clustering approach uses"Agglomerative Hierarchical Clustering" asa clustering technique, with the minimumvariance criterion [15], and using theEuclidian distance to measure the differencebetween two traces, defined in as follows:

n

1i

2BiAi )T-(T=B)(A,dist

Where

dist (A, B) = distance between trace A andtrace B.n = number of features in de Feature Set.TAi = number of times de feature in theappears in the trace A.

3.2. Clu sterin g Tech nique toInclude the Temporal VariableTo identify the various types of changes thatcan occur in business processes we mustfind a way to identify all the versions of aprocess. If we only look at the structuralfeatures (control-flow) then we leave outinformation regarding temporality. Both tem-

poral and structural features are very importantsince the structure indicates how similar oneinstance is to another and the temporalityshows how close in time the two instances are.Our approach looks to identify the differentforms of implementing the process using bothfeatures (structural and temporal) at the sametime, as is illustrated in Figure 2.

In order to mitigate the effects of externalfactors that are difficult to control, we useonly the beginning of each process instanceas the temporal variable.

For each trace, we store the time that haveelapsed since a reference timestamp in thetime dimension, for example, the number ofdays (or hours, minutes or seconds, dependingon the process) elapsed since January 1st,1970, to the timestamp in which the trace’sfirst activity begins (see Table 3).

In this new approach, "AgglomerativeHierarchical Clustering" with the minimumvariance criterion [15] is also used as aclustering technique.

To calculate the distance between two tra-ces we use the Euclidian distance, butmodified in order to consider at the sametime the structural and temporal features.

First, we define JiT as the feature i of thetrace J. If the feature i cannot be found in thetrace J, its value will be 0, otherwise its valuewill be the number of times that the feature

i appears in the trace J.

)1( nJT corresponds to the temporal featureof the trace J and its value is the number ofdays (or hours, minutes or seconds,depending on the process) that haveelapsed since a reference timestamp. Theindex (n+1) is given to indicate that it is tobe added to the structural features.

We define L as the set of all the log traces,

and the expression )( JiLJ TMax representsthe largest number of times the feature iappears in an event log trace. In the same

way, )( JiLJ TMin corresponds to thesmallest number of times the feature i appearsin an event log trace.

)( )1( nJLJ TMin and )( )1( nJLJ TMaxcorrespond to the earliest and latest time inwhich an event log trace begin. Also wedefine ),( BAD E and ),( BADT as thestructural and temporal distance betweenthe trace A and the trace B, respectively.

The basic idea proposed by Bose is to considersubsequences of activities that repeat in multiple traces

as feature sets for the implementation of clustering

Figure 2. Example of the Relevance of Considering Time in the Analysis.

Trace \ Feature set bb bbbc

bbbcd 2 1

bbbc 2 1

caa 0 0

Table 2. Structural Features Matrix.

where:

n = number of features from the Structural Features Set.

Even though both distances, ED and TD arenormalized, since the domain of ED is greaterthan the one of TD , we define MinE , MaxE ,MinT and MaxT as:

n

i JiLJJiLJ

BiAiE TT

TTBAD1

2

)(min)(max),(

2

)1()1(

)1()1(

)(min)(max),(

nJLJnJLJ

nBnAT TT

TTBAD

43



Base approach: Trace clustering basedon conserved patterns. Our approach: Extended trace clustering,where the temporal dimension is incorporated.In this technique, the weight of the temporaldimension,, can be given different values,which vary between 0 and 1.

3) For each of the clusters a discovery processis carried out, generating two new models( BA MandM ).

The approach’s performance can bemeasured at two points:a) Conformance 1: The metrics are measuredbetween any of the original models and one ofthe generated clusters.b) Conformance 2: Metrics are measuredbetween any of the original models and one ofthe generated models.

The metrics used to measure Conformance1 are the following: Accuracy: Indicates the number ofinstances correctly classified in each clus-ter, according to what is known about theoriginal events log. These values are between0% and 100%, where 100% indicates that theclustering was exact. Fitness: Indicates how much of theobserved behavior in an event log, (forexample, cluster AC ) is captured by theoriginal process model (for example, model

2M ) [12]. These values are between 0 and 1,where 1 means the model is capable ofrepresenting all the log traces. Precision: Quantifies whether the ori-ginal model allows for behavior completelyunrelated to what is seen in the event log.These values are between 0 and 1, where 1means that the model does not allowbehavior additional to what the tracesindicate [13].

The metrics used to measure Conformance2 are the following [14]: Behavioral Precision ( PB )

Structural Precision ( PS )

Behavioral Recall ( RB )

Structural Recall ( RS )

These metrics quantify the precision andgenerality of one model with respect toanother. The values of these four metricsare between 0 and 1, where 1 is the bestpossible value.

Table 4 summarizes the results of applyingthe base approach and the new approach(varying the value of the parameter µ), to thedifferent synthetic event logs created. Thistable shows the accuracy metric, whichindicates the percentage of correctlyclassified traces.

For each log, the highest accuracy is reachedwith our approach, but with different µ values(varying between 0.2 and 0.9). The accuracyvaries with different µ values because ineach log the relevance of the temporaldimension versus the structure of the processis not the same.

We use Log (f) to make a more in-depthanalysis of the results, measuring all themetrics defined both in Conformance 1 andConformance 2 (see Table 5).

All metrics calculated for Log (f) show goodresults when the parameter µ has a value of0.5 or 0.6. When the five metrics are averaged,the highest overall average is obtained withµ equal to 0.5.

5. Conclusions and Future WorkIn this article we present the limitations ofcurrent clustering techniques for processmining, which center on grouping structurallysimilar executions of a process. By focusingjust on the structure of the executions, theprocess’s evolution over time (ConceptDrift) is left out. New techniques have beendeveloped to address this, but these alsopresent limitations since they focus onfinding the points in which the processchanges, which is limited to just one kind ofchange, Sudden Drift.

We present an approach that extendscurrent clustering techniques in order tofind the different versions of a processthat changes overt time (in multiple ways,i.e., different types of Concept Drifts),allowing for a better understanding of thevariations that occur in the process andhow, in practice, it is truly being performedover time.

Our work focuses on the identification ofmodels associated with each version of theprocess. The technique we propose is a toolthat helps business managers to makedecisions. For example, it can help determi-ne if the changes produced in the implement-ation of the process are really those that areexpected, and based on this, take the properaction if they discover abnormal behaviors.Also, by understanding and comparing thedifferent versions of a process, good and badpractices can be identified, which are highlyuseful when the time comes to improve orstandardize the process.

In this document we present a set of metrics tomeasure the performance of our approach, i.e.,whether the approach is able to cluster data inthe same way the data was created. Thus, thesemetrics require a priori process information,which is not feasible for real cases.

Each metric measures a different aspect,which when used together, allow for a multi-

MinE and EMax correspond to the minimumand maximum (normalized) distance betweenall traces, considering only structural features.

MinT and MaxT correspond to the minimumand maximum (normalized) distance betweenall traces, considering only temporal features.

The new way of measuring the distance between

two traces, ),( BAdist , incorporates theparameter µ, which we will call the weight ofthe temporal dimension, which serves to weighthe structural and temporal features.Additionally, this new way of measuring the

distance adjusts ED and TD in such a way thatthe weight of both distances are equivalent.

EE

EE

MinMaxMinBADBAdist

),()1(),(

The weight of the temporal dimension, ,can have values between 0 and 1, accordingto the relevance given to the temporalfeature.

4. EvaluationWe analyzed the proposed technique usingsix event logs obtained from differentsynthetic processes, which were created withCPN Tools [10][11]. In order to measuretheir performance we used the Guide TreeMiner plug-in [3] available in ProM 6.1 aswell as a modified version of this plug-in thatincorporates the proposed changes.

The evaluation was carried out using differentmetrics to measure the new approach’sclassification effectiveness versus the baseapproach.

4.1. Experiments and ResultsFigure 3 shows the sequence of stepsperformed in the experiments.

1) To create the synthetic log, a simulationwas used based on two designed models, M1and M2, using CPN Tools.2) The method starts by applying the clusteringtechnique on the synthetic log received, whichgenerates a given number of clusters. In thiscase, two clusters ( BA CandC ). Twoclustering techniques are applied:

TT

TT

MinMaxMinBAD

),(



monograph

Figure 3. Steps for Performing the Experimental Tests.

Table 4. Accuracy Metric Calculated for the Six Synthetic Test Logs.

Approach µ Log (a) Log (b) Log (c) Log (d) Log (e) Log (f)

Base - 57% 38% 55% 86% 99% 53%

0.0 59% 52% 49% 82% 59% 51%

0.1 65% 52% 49% 82% 59% 68%

0.2 87% 52% 63% 100% 59% 64%

0.3 87% 52% 63% 100% 59% 62%

0.4 95% 52% 63% 100% 59% 63%

0.5 100% 52% 63% 100% 59% 95%

0.6 100% 52% 73% 100% 100% 94%

0.7 96% 89% 73% 100% 100% 67%

0.8 78% 88% 73% 74% 84% 55%

0.9 79% 77% 77% 68% 77% 78%

New

1.0 81% 78% 68% 73% 72% 55%

45



faceted vision that makes the analysis morecomplete.

One key aspect of our clustering approach isthe value it is given to the weight of thetemporal dimension, parameter µ, which isclosely related to the nature of the process.High µ values give a greater importance totime when carrying out the clustering,whereas low µ values give more importanceto the structural features of the process.

The experiments results show that theapproach proposed in this document has abetter performance and that exists at least avalue for the parameter µ that gives betterresults in comparison to only using thestructural clustering technique (Traceclustering based on patterns). This isachieved because our approach is capable ofgrouping the log traces in such a way so as toidentify structural similarity and temporalproximity at the same time.

One of the metrics used is accuracy. In someexperiments, this metric reached 100%,meaning all traces were classified correctly.When a 100% accuracy was not reached, it

was because there are processes whosedifferent versions are similar amongstthemselves, and therefore there are tracesthat can correspond to more than one versionof the process, which makes the classificationnot exactly the same as what is expected.

Our future work in this line of investigationis to test the new approach with realprocesses. We also want to work ondeveloping the existing algorithms so thatthey are capable of automatically determiningthe optimal number of clusters. In order todo so, it will be necessary to define newmetrics that will allow us to calculate theoptimal number of clusters without a prioriinformation of the process versions.

Table 5. Different Metrics for Analyzing Log (f)

Approach µ Accuracy Fitness Precision Average

Average

Overall

Average

Base - 53% 0.93 0.78 0.77 0.81 0.76

0.0 51% 0.93 0.73 0.73 0.70 0.72

0.1 68% 0.93 0.81 0.86 0.86 0.83

0.2 64% 0.92 0.80 0.83 0.82 0.80

0.3 62% 0.92 0.80 0.80 0.78 0.78

0.4 63% 0.92 0.79 0.83 0.86 0.81

0.5 95% 0. 95 0.85 0.97 0.96 0. 94

0.6 94% 0.95 0.86 0.95 0.94 0.93

0.7 67% 0.92 0.84 0.84 0.89 0.83

0.8 55% 0.94 0.87 0.88 0.94 0.84

0.9 78% 0.94 0.81 0.89 0.95 0.87

New

1,0 55% 0.93 0.87 0.88 0.95 0.84



monograph

References

[1] W. van der Aalst. Process Mining, Discovery,Conformance and Enhancement of Business Processes.Springer, 2011. ISBN 978-3-642-19345-3.[2] R.J. Bose, W. van der Aalst, I. Zliobaite, M.Pechenizkiy. Handling Concept Drift in ProcessMining. 23rd International Conference on AdvancedInformation Systems Engineering. London, 2011.[3] R. Bose, W. van der Aalst. Trace ClusteringBased on Conserved Patterns : Towards AchievingBetter Process Models. Business ProcessManagement Workshops, pp. 170-181, 2010. Berlin:Springer Heidelberg.[4] W. van der Aalst, B. van Dongen, J. Herbst, L.Maruster, G. Schimm, A. Weijters. Data & KnowledgeEngineering, pp. 237-267, 2003.[5] M. Song, C. Günther, W. van der Aalst. TraceClustering in Process Mining. 4th Workshop on Bu-siness Process Intelligence (BPI 08), pp. 109-120.Milano, 2009.[6] R. Bose, W. van der Aalst. Context Aware TraceClustering: Towards Improving Process MiningResults. SIAM, pp. 401-412, 2009.[7] W. van der Aalst, A. Adriansyah, A.K. Alves deMedeiros, F. Arcieri, T. Baier, T. Blickle et al. ProcessMining Manifesto, 2011.[8] T. Stocker. Time-based Trace Clustering forEvolution-aware Security Audits. Proceedings of theBPM Workshop on Workflow Security Audit andCertification, pp. 471-476. Clermont-Ferrand, 2011.[9] D. Luengo, M. Sepúlveda. Applying Clustering inProcess Mining to find different versions of a businessprocess that changes over time. Lecture Notes inBusiness Information Processing, pp. 153-158, 2011.[10] A.V. Ratzer, L. Wells, H.M. Lassen, M. Laursen,J. Frank, M.S. Stissing et al. CPN Tools for editing,simulating, and analysing coloured Petri nets.Proceedings of the 24th international conference onApplications and theory of Petri nets pp. 450-462.Eindhoven: Springer-Verlag, 2003.[11] A.K. Alves De Medeiros, C. Günther. ProcessMining: Using CPN Tools to Create Test Logs forMining Algorithms. Proceedings of the Sixth Workshopand Tutorial on Practical Use of Coloured Petri Netsand the CPN Tools, pp. 177–190, 2005.[12] A. Rozinat, W. van der Aalst. Conformancetesting: Measuring the fit and appropriateness ofevent logs and process models. Business ProcessManagement Workshops, pp. 163-176, 2006.[13] J. Muñoz-Gama, J. Carmona. A fresh look atprecision in process conformance. ProceedingBPM’10, Proceedings of the 8th internationalconference on Business process management, pp.211-226, 2010.[14] A. Rozinat, A.K. Alves De Medeiros, C. Günther,A. Weijters, W. van der Aalst. Towards an EvaluationFramework for Process Mining Algorithms. Genetics,2007.[15] J. Ward. Hierarchical Grouping to Optimize anObjective Function. Journal of the AmericanStatistical Association, pp. 236-244, 1963.

CLEI --- www. clei. org --- es una asociación sin fines de lucro con casi 40 años de existencia, y

que reúne a más de 100 universidades, centros de investigación y asociaciones de profesionales

latinoamericanas, y algunas extra-territoriales (españolas y estadounidenses) interesadas en

promover la investigación y docencia en Informática.

CLEI invita anualmente a presentar trabajos que reporten resultados de investigación y/o

experiencia originales durante su Conferencia Latinoamericana en Informática (CLEI). En la

conferencia, que se realiza típicamente en octubre, participan aprox. 1000 personas, y se reciben

para sus simposios, más de 500 artículos en castellano, inglés y portugués, de los cuales el 33%

son aceptados.

A partir del 2012, las memorias de la CLEI son publicadas en IEEE Xplore. Además, números

especiales del CLEI Electronic Journal son dedicados a trabajos seleccionados de esta

conferencia.

Simposios CLEI

Simposio Latinoamericano de Ingeniería del Software Simposio Latinoamericano de Informática y Sociedad Simposio Latinoamericano de Investigación de Operaciones e Inteligencia

Artificial Simposio Latinoamericano de Infraestructura, Hardware y Software Simposio Latinoamericano de Sistemas Innovadores de Datos Simposio Latinoamericano de Teoría Computacional Simposio Latinoamericano de Computación Gráfica, Realidad Virtual y

Procesamiento de Imágenes Simposio Latinoamericano de Sistemas de Información de Gran Escala

Eventos Asociados (realizados en paralelo anual o bienalmente )

Congreso Iberoamericano de Educación Superior en Computación (CIESC) Concurso Latinoamericano de Tesis de Maestría (CLTM) Congreso de la Mujer Latinoamericana en la Computación (LAWCC) Simposio de Historia de la Informática en América Latina y el. Caribe (SHIALC) Latin America Networking Conference (LANC) Workshop en Nomenclatura y Acreditación en Programas de Computación

Novatica 223 - Process Mining (English Edition)

Data & Analytics

Transcript of Novatica 223 - Process Mining (English Edition)