I-1-ed.doc

29
Introduction to data mining Lecture 1: prof. dr. Nada Lavrač [ It is a great honor that I am the first speaker of th is start of the progamme 'New media and e-science' of Jozef Stefan Institute Postgraduate School. In this classroom we have students of the 'New media and E- science programme' as well as some students from the statistics programme, that is a programme which is held at the University of Ljubljana and it is a programme which was established by a couple of faculties jointly . So, as much as I know we have three or four students from there. So, these is my list of people who were supposed to be here, some invited people, some students of 'New media and E- science programme'. Could these people raise their hands if they are here? ] PartI.Introduction D ata M ining and the KD D process W hy D M :Exam ples ofdiscovered patterns and applications C lassification ofD M tasks and techniques Visualization and overview ofD M tools To get a quick grasp on what Ok, so what it is data mining (DM) is about , we first take a » bird's eye« view of things , to see what DM borrows from other areas of computer science. Then we look and KDD proces . We will give at some examples of these covered patterns that were discovered in data, and introduce the techniques that were used to » mine « that data. applications. We will try to provide a useful classification of DM data mining tasks and techniques, and introduce show some mechanisms for visualising both the data and the results of data mining. PartI.Introduction D ata M ining and the KD D process W hy D M :Exam ples ofdiscovered patterns and applications C lassification ofD M tasks and techniques Visualization and overview ofD M tools

description

 

Transcript of I-1-ed.doc

Page 1: I-1-ed.doc

Introduction to data miningLecture 1: prof. dr. Nada Lavrač

[ It is a great honor that I am the first speaker of this start of the progamme 'New media and e-science' of Jozef Stefan Institute Postgraduate School. In this classroom we have students of the 'New media and E- science programme' as well as some students from the statistics programme, that is a programme which is held at the University of Ljubljana and it is a programme which was established by a couple of faculties jointly.So, as much as I know we have three or four students from there. So, these is my list of people who were supposed to be here, some invited people, some students of 'New media and E- science programme'.Could these people raise their hands if they are here? ]

Part I. Introduction

Data Mining and the KDD process

• Why DM: Examples of discovered patterns and applications

• Classification of DM tasks and techniques

• Visualization and overview of DM tools

To get a quick grasp on what Ok, so what it is data mining (DM) is about, we first take a »bird's eye« view of things, to see what DM borrows from other areas of computer science. Then we look and KDD proces. We will give at some examples of these covered patterns that were discovered in data, and introduce the techniques that were used to »mine« that data. applications.We will try to provide a useful classification of DM data mining tasks and techniques, and introduce show some mechanisms for visualising both the data and the results of data mining.

What is DM

• Extraction of useful information from data: discovering relationships that have not previously been known

• The viewpoint in this course: Data Mining is the application of Machine Learning techniques to “hard” real-life problems

So, Ddata mining can be simply would be defined as the extractingon of useful information from the data, trying to discovering the relationships that which have perhaps (?) been unknown until this discovery process has occured. More specifically, the point of view And the view point of in this course is that data mining is the application of machine learning techniques to hard, real- life problems.

Part I. Introduction

Data Mining and the KDD process

• Why DM: Examples of discovered patterns and applications

• Classification of DM tasks and techniques

• Visualization and overview of DM tools

What is DM

• Extraction of useful information from data: discovering relationships that have not previously been known

• The viewpoint in this course: Data Mining is the application of Machine Learning techniques to “hard” real-life problems

Page 2: I-1-ed.doc

If we take just the two expressions Here you have »data mining« and »here you have machine learning«, these so the two areas things don't seem to have a lot in common. [Machine learns; who mines?] large overlap basicly. The Machine learning expression »machine learning« (ML) comes from the area of artificial intelligence, so this (?) is one area of artificial intelligence which deals with learning from data, in the sense of and also extracting useful (?) knowledge from out of the data. SO??? In comparison with machine learning, d

Data mining is more comes more from the application-oriented, and frequently concerned with exploring huge ve perspective having amounts of data available data. This data is then (?) examined (»mined«) for the presence of any and trying to extract some interesting information and/or knowledge which may be implicit in it. from that data.So, the difference between ML and DM is largely a question of emphasis (which one?), and in this course, weI will not make a very strong big diestinction between the two.

As the term suggests, mMachine learning was inspired comes more from by the idea that a machine cwould learn in a way similar to the way as a person, a human learns. One characteristic of human learning is that wWith more and more experience, with more and more information, you know more and more and you get are better in at solving new problems that which you encounter. SO??? (One characteristic of machine learning is …)So, the established term from artificial intelligence is machine learning. There are European machine learning conferences and also International machine learning conferences. This whole area has been is around for thirty years or so. The term ». dData mining« whas been coined much latter, and . This term is perhaps ten years old. This term And it is much more accurate (?) and attractive for comercial purposes and has therefore it has found its way also into science too. Besides machine learning, DM has an extensive interaction with many other areas of science and practical information techologies.

Related Areas

Database technology

and data warehouses

• efficient storage, access and manipulation of data

DM

statistics

machinelearning

visualization

databases

text and Web mining

softcomputing pattern

recognition

We see data mining as having a large interaction with many other areas of science and practical information techologies.Of course, to mine data you have to have data. That is, since First of all as we are inducing knowledge out of from data, we have to have some (real) data available in the first place. of course that This is obvious condition prerequisite for data mining then and this connects then to things such as data based technology and also to data warehouses, which deals with thea problem of efficient storage, acces and manipulation of the data.

Related Areas

Database technology

and data warehouses

• efficient storage, access and manipulation of data

DM

statistics

machinelearning

visualization

databases

text and Web mining

softcomputing pattern

recognition

Page 3: I-1-ed.doc

Related Areas

Statistics,

machine learning,

pattern recognition

and soft computing*

• classification techniques and techniques for knowledge extraction from data

DM

statistics

machine learning

visualization

databases

text and Web mining

softcomputing pattern

recognition

* neural networks, fuzzy logic, genetic algorithms, probabilistic reasoning

In general, there are various techniques for dealing with data, depending on the kind of the data. WWe will see that that some techniques deal with simple data in a single relational table, some techiques deal with multirelational data bases; , relational data bases some techniques also deal with also with unstructured data in textual form, and also learning data mining from some with imagess and so on.

Related Areas

Statistics,

machine learning,

pattern recognition

and soft computing*

• classification techniques and techniques for knowledge extraction from data

DM

statistics

machine learning

visualization

databases

text and Web mining

softcomputing pattern

recognition

* neural networks, fuzzy logic, genetic algorithms, probabilistic reasoning

Related Areas

Text and Web mining• Web page analysis• text categorization• acquisition, filtering

and structuring of textual information

• natural language processing

DM

statistics

machinelearning

visualization

databases

text and Web mining

soft computing

pattern recognition

One related area that may not be readily intelligable here is The other connecting areas are: machine learning pattern recognition, statistics and soft computing, where under the area of soft computing, by which we mean we understand neural networks, fuzzy logic, genetic algorithms (?), probabilistic reasoning and similar methods. To je ze na slajdu – pa drugo?

About genetic algorithms there is a special course which will be given by our colleague Bogdan Filipič. These are optimisation methods with genetic algorithms and that is also a part of 'New media and E- science programme'.

Page 4: I-1-ed.doc

Related Areas

Text and Web mining• Web page analysis• text categorization• acquisition, filtering

and structuring of textual information

• natural language processing

DM

statistics

machinelearning

visualization

databases

text and Web mining

soft computing

pattern recognition

Related Areas

Vizualization

• visualization of data and discovered knowledge DM

statistics

machine learning

visualization

databases

text and Web mining

softcomputing pattern

recognition

The bordering area of text and Web mining It has to do with web page analysis, text categorization, acquisition, filtering and structuring of textual information. These and all this tasks all tend to have connects aslo with a strong need for a natural language processing, which is also an independent separate area of information processing.

Related Areas

Vizualization

• visualization of data and discovered knowledge DM

statistics

machine learning

visualization

databases

text and Web mining

softcomputing pattern

recognition

and this is a prerequisite for succesful text and web mining.As to About visualisation, -we will explore see some several ways of visualising both the data and as well as visualising the discovered knowledge discovered through data mining. (why not process too?)

Page 5: I-1-ed.doc

Point of view in this tutorial

Knowledge discovery using machine learning methods

Relation with statistics

KDD

statistics

machinelearning

visualization

databases

text and Web mining

soft computing

pattern recognition

Our own expertise and research is largely located in the neighbouring area And of course of machine learning (in – that's where our research is mainly in, and this as I explained before is an expression coming from artificial intelligence), and we are also in very interested here in the connections relation between of machine learning and with statistics. If time allows I will have some slides on that relation later on.There is obviously a very strong connection between all these three – machine learning, data mining and statistics. So, They all have to do with inductive techniques for data analysis.We So, you have data, you start from the data and attempt to then you induce information or knowledge from it. out of the data.In Tthis induction involves reasoning about the it goes from reasoning from properties of the data or individuals and moves to properties which hold for on a population of individuals.

Machine Learning and Statistics

• Both areas have a long tradition of developing inductivetechniques for data analysis.– reasoning from properties of a data sample to

properties of a population• KDD = statistics + marketing ? No !• KDD = statistics + ... + machine learning• Statistics is particularly appropriate for hypothesis testing

and data analysis when certain theoretical expectations about the data distribution, independence , random sampling, etc. are satisfied

• ML is particularly appropriate when requiring generalizations that consist of easily understandable patterns, induced both from small and large data samples

The relation between data mining or machine learning and statistics perhaps requires some additional comments. Some people might well ask »But could say well wwhy do we need data mining and knowledge discoverying in data baseis, isn't this just statistics and plus some good marketing around it?«Our answer to this isWe would say: : much has been done in machine learning quite , separately from statistics., As yYou will see, the representation formalisms in whch for the induced knowledge is represented data are already is much rather different. The whole Also sometimes the emphasis is also different: to . And if we summarize would very roughly, sum up, we would possibly say that statistics is particularly appropriate for hypothesis testing and data analysis under with certain theoretical expectations about data distribution, independence, random saempling and so on. That is, are statistics has strong requirements about satisfied so strong machinery and artillery about what the the data should be like if that we can then induce reasonable generalizations statistical rules are to be induced for the whole which hold on for a population.

Page 6: I-1-ed.doc

Obviously, you should start with a sufficiently large data sample. If the sample is not large enough, statisticians will not touch the sample it and will just say: »we can't do anything«. And tThey have strong reasons for saying that, of course.

However, Well machine learning people are much more daring - , occasionally, we start with very few data and we dare to say something about what is hidden in thatis data. An Now this is particularly interesting example of this is with micrograde data analysis, which is analysis in done in bioalogical informatics, where getting the samples can be is extremely expensive, and the there is a laborius process of obtaining more before another samples may be very is laboriusobtained. It turns out that actually that even not with not that much many data available we you can actually induce some interesting pieces of knowledge that which can be of useful for scientific purposes.So, with machine learning we occasionally dare occasionally to induce some knowledge out of from less data a statistiian woulddemand. Furthermore, and our main aim is to induce knowledge in the a form that which is understandable to humans, which is and that's one of the strong points of machine learning and data mining.There is also another pair of expressions that sound similar but need to be distinguished: So wwhat is is the relation between the expressions »data mining« and »knowledge discovery in data baseis«?.

Data Mining and KDD

• Data Mining (DM) is a way of doing data analysis, aimed at finding patterns, revealing hidden regularities and relationships in the data.

• Knowledge Discovery in Databases (KDD) provides a broader view: providing tools to automate the entire process of data analysis, including statistician’s art of hypothesis selection

• DM is the key element in this much more elaborate KDD process called

• KDD is defined as “the process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.” *

Usama M. Fayyad, Gregory Piatesky-Shapiro, Pedhraic Smyth: The KDD Process for Extracting Useful Knowledge form Volumes of Data. Comm ACM, Nov 96/Vol 39 No 11

Data mining, like similar as machine learning, is aimed at finding patterns, revealing hidding regularities and relationships in the data, where as knowledge discovery in data baseis is viewed seen as a broader process which that includes also data »cleaning« and , selecting the data, as well as and includes the entire process of data analysis, including the statistician's art of hypothesis selection. Thus, And we view data mining ias only one of the elements in theis process of knowledge discovery.

Page 7: I-1-ed.doc

KDD ProcessKDD Process: overall process of discovering useful knowledge from data

• KDD process involves several phases:

• data preparation

• data analysis (data mining, machine learning, statistics)

• evaluation and use of discovered patterns

• Data analysis/data mining is the key phase, only 15%-25% of the entire KDD process

The KDD So, if we illustrate this process startsing from the data, by selecting the target data for analysis, preprocessing it in a n appropriate suitable form way to trying to eliminate missing data, noisy data, irregularities, and so on. These steps transforming it the data into a form programme which that is appropriate for particular DM the data mining tools, which we have selected then produce we get the patterns or the models which we have induced from the data. and Ffinally, there is we have an the evaluation step that hopefully leadsding to some usefull piece of knowledge. This whole process is also iterative and interactive: And as you see it is an interactive process where we often go return back, at different steps, and try ing to improve both the data itself and the process of also trying to improve the data analysis before doing a the final selection and evaluation.

So, interesting enough Although the process of data mining it is also a key phase in KDD, the knowledge discovery process it only makes up takes a minor part of the whole entire process. In practice, iIt turns out that the phase of data cleaning aind preparation is often is much more extremely laborious than the DM phase.compare to data mining itself.

Part I. Introduction

• Data Mining and the KDD process

• Why DM: Examples of discovered patterns and applications

• Classification of DM tasks and techniques

• Visualization and overview of DM tools

Let us now take a closer look at some typical results of data mining, such as So, I will go now to some discovered patterns discovered patterns and applications. The examples will come from a large project we coordinated at Jozef Stefan Institute between 2000 and 2003..

Page 8: I-1-ed.doc

The SolEuNet Project

• European 5FP project “Data Mining and DecisionSupport for Business Competitiveness: A European Virtual Enterprise”, 2000-2003

• Scientific coordinator IJS, administrative FhG

• 3 MEuro, 12 partners (8 academic and 4 business)from 7 countries

• main project objectives:– development of prototype solutions for end-users

– foundation of a virtual enterprise for marketing DM and DS expertise, involving business and academia

These examples crown from a large project we were coordinating at Jozef Stefan Institute. This was in years 2000 and 2003. The project was called named »Data Mmining and DDecision Ssupport for Bbusiness Cocompetitivenes: a the European VVirtual Enterprise«, and we coordinated its scientific side. It was a fairly large, three million euro project, with which has been coordinated between ourselves. We had twelve partners altogether, involved. It was a three million euro project, eight academic partners and four business partners from seven different countries in European countries.And Tthe idea of the project was to use DM data mining and DS (Ddecision Ssupport) technologies for building usefull solutions for end-- users, and . They were developed them until to the prototype stagephase. This also involved us in and that was covered by the many of the project. And we also had a go founding a virtual enterprise for marketing DM data mining and DS decision support expertiese, involving both business and academia.Actually, the such a virtual enterprise model was developed to see how different partners across Europe would work collaboratively in DM data mining and DS decision support taskstasks. We developed various ideas and mechanisms for , how to work collaboratively work in DM data mining projects, and .a number of application prototypes.

Developed Data Miningapplication prototypes

• Mediana – analysis of media research data • Kline & Kline – improved brand name recognition• Australian financial house – customer quality evaluation,

stock market prediction• Czech health farm – predict the use of resources• UK County Council - analysis of traffic accident data• INE Port. statistical buro – Web page access analysis for

better INE Web page organization• Coronary heart disease risk group detection• Online Dating – understanding email dating promiscuity • EC Harris - analysis of building construction projects• European Comission - analysis of 5th Fr. IST projects:

better understanding of large amounts of text documents, and “clique” identification

To illustrate some data analysis techniques, let us look in more detail at the first of these applications.There was a number of prototype developments within this project starting from analysis of media research data where our client was Mediana company, we worked for Kline&Kline another marketing company trying to improve brand name recognition. We worked on the data for an Australian financial house doing quality evaluation of costumers and stock market prediciton. For a Czech health farm we were predicitong the use of resources like people going to swimming pool, people having a massage with a nurse and how resources would be used for every week by the new people coming to the health farm. Then we worked for the

Page 9: I-1-ed.doc

UK County Council and I will illustrate the KDD process also through this application. It was the analysis of traffic accident data of UK accidents which occured in the last 20 years. We also did an application for a Portuguese statistical bureau about web page acces analysis and the idea was that based on the understanding of how people browse the web pages of Statistical bureau can we learn how to improve the web page organization. We developed an application for the Zagreb Medical Centre coronary heart disease risk group prediction and detection. We had an interesting application for small English company who is doing on-line dating so girls and boys would try to meet their future husband and wife through internet and the idea was can we propose better datings based on the profiles of the people. And that was quite an interesting application.There was an application for construction projects and there was a very interesting application for the European Comission in which we have analysed all the IST projects which were funded by the EC in the previous framework programme. And the idea was to try and understand better where the large amount of money went to, so we were using text mining and web mining in order to analyse this data and I will illustrate this application as well.

MEDIANA - KDD process

• Questionnaires about journal/magazine reading, watching of TV programs and listening of radio programs, since 1992, about 1200 questions. Yearly publication: frequency of reading/listening/watching, distribution w.r.t. Sex, Age, Education, Buying power,..

• Data for 1998, about 8000 questionnaires, covering lifestyle, spare time activities, personal viewpoints, reading/listening/watching of media (yes/no/how much), interest for specific topics in media, social status

• good quality, “clean” data• table of n-tuples (rows: individuals, columns: attributes, in

classification tasks selected class)

The So, starting with the Mediana case study applicationmight be especially interesting for those of you familiar with , specially for the people from statistics, you are very well aware of questionaries, doing analyseis of questionnarires with statistical methods. Our starting point were Here we started with the questionnarires about people's reading habits and habits of watching TV programmes and , listening to the radio programmes. For the a number of years, Mediana has been collecting and publishing such information. And These that were extensive questionnarires, with about 1200 questions, though obviously not all of them questions were actually posed given to the people. Some of the questions were derived, so that but by answering a certain question then you have automatically answered a couple of others as well. Still, it was 's a quite laborious process. Students would go to homes, asking people about their habits of consuming media offers, and the results of these analysing the responsesis weare then published in a yearly publication . The analysis Itwas mainly 's basically the basic statistics about the frequency of reading, listening and watching different various TV programmes, and then distributions with respect to sex, age, education, nt application, buying habits and so on.So, we triyed to do an alternative analysis. For this Wwe obtained got 8000 questionnairies from . This was data for the year 1998 and that was about reading, listening and watching media contents, including information about lifestyle, spare time activities, personal view points, reading, listening, watching of media and so on.

Page 10: I-1-ed.doc

The quality of It was the data was a very good quality data, it was obviously collected by the professionals have collected it, and also it was also cleaned before we got it. The Thus, Mediana itself cleaned the data so it was extremely good quality data so we didn't have to bother with about noise elimination, missing data and things like that.So Bbasically, we had got a large table of n-tuples, one where each row for each coresponds to one person who was answering these questions, one individuum, and the columnses were about the attributes such as , about lifestyle, spare time activities and so on.

MEDIANA - Pilot study

• Patterns uncovering regularities concerning:– Which other journals/magazines are read by readers of

a particular journal/magazine ?– What are the properties of individuals that are

consumers of a particular media offer ?– Which properties are distinctive for readers of different

journals ?• Induced models: description (association rules, clusters)

and classification (decision trees, classification rules)

((For classification purposes we needed to select an one attribute as which was the target attribute for the classification.)) We then set out to mine the data, Then after this preprocessing was done we did data mining and in this here we were free to choose whhichat DM data mining tasks to perform, because the client Mediana did no't provide give us with any specific requests in advance. So, we decided to answer certain »natural« some questions, such as like which magazines and journals are read by the readers of another journal, what are the properties of the consumers of particular media offer and what ich properties are destinctive for readers of specific different journals.And Wwe have induced several some models, descriptive and classification models, for such questions, and I will now go through some of them to illustrate the formalisms in which they were formulated. One such representation formalism, used in machine learning and data mining, are so-called decision trees.

Decision treesFinding reader profiles: decision tree for classifying people

into readers and non-readers of a teenage magazine.

Here you see one representation formalism which is used in machine learning and data mining and these are so called decision trees.

Page 11: I-1-ed.doc

In dDecision trees, are formalism where you have attributes in the internal nodes of the tree, in this case things such as like: age, gender, visitationing of disco clubes, being whether you are interested in astrology, what is your gender and so on. ...Then Oon the branches or arcs of the tree you have the values of these attributes, they are either numeric or descreyte, such as like: gender (being male or female), or interested in astrology (, yes or no). And then in the leaves of the decision tree you have the corresponding class, readers or non-readers, so in this case problem we had a binary classification problem, distinguishing people that read from those that don't read to readers and none readers of a particular magazine.

Actually, the magazine in question it was a teenage magazine called Antena, and the most informative attribute in theis decision tree, which was automatically induced from the questionaires,y was that the most informative attribute was age. If people were are older then 25 then they didon't read the magazine, which is obviously agreed with according to our initial expectations. If the people were are younger, the reading habits depended on on whether they visited disco clubes; if they didon't visit them, disco clubes, on whether they weare interested in music, astrology, travel and scandals. If they were sointerested in these things, they tended to read the magazine, otherwise not.

This looks like a very categorical decision scheme, but actually here in such a leaf in the decision tree does not mean it's not that every person who has the these properties on the corresponding branch indeed reads Antena, but only that most the majority of the people with those properties read itAntena. And Hhere, we would split tipple the distribution of people who read and don't read the magazine, and if the distribution wais strongly in favour of one group, we would label the leaf with the corresponding the class like that. We could continue to building the tree by further splitting this nodes in to subnodes, which and we would get us so- called pure leaves, with which would contain people of only one particular class. and Wwe will see later in our lecture today that sometimes it's better to stoppping building a such a decision tree before you come to these pure leaves, a pure classifications, because such »unfinished« decision trees can would then be more occurate classifiers for when classifing new, unseen examples.

Classification rulesSet of Rules: if Cond then Class

Interpretation: if-then ruleset, or

if-then-else decision list

Class: Reading of daily newspaper EN (Evening News)

if a if person does not read MM (Maribor Magazine) and rarely reads the weekly magazine “7Days”

then the person does not read EN (Evening News)

else if a person rarely reads MM and does not read the weekly magazine SN (Sunday News)

then the person reads EN

else if a person rarely reads MM

then the person does not read EN

else the person reads EN.

Another formalism of induced knowledge are would be classification rules. Such rRules have are in the form »if condition then class«,. They are also called If-then Rulesmeaning that if the condition is fulfilled, the thing in question belongs to that class. These rules are also called if-then rules, or if-then-else rules, We induce a set of rules so we call them if-then rules or if-then else rules or decision lists. And Ffor the Mediana case, we were interested for instance in

Page 12: I-1-ed.doc

the question : what sort of ether some people read the Eevening news, a magazine published in Maribor. . So in slovenian that would be a 'Večer časopis'. So, what are the properties of people who read Večer. Based on the the other reading habits of these people, for and we were considering all the magazines and daily newspapers, and the following rule was induced:. iIf a person doesn't read a Maribor magazine, and rarely reads the weekly magazine »Seven days«, then the person doesn't read the Evening news.Furthermore, As if a person rarely reads the (?) Maribor magazine and doesn't read thea weekly magazine Sunday news, (Nedeljski dnevnik) then this person reads the Evening news. Such As if and so on. There is an if and- then rules which can be induced automatically induced from the data.

Association rulesRules X => Y, X, Y conjunction of bin. attributes

• Support: Sup(X,Y) = #XY/#D = p(XY)

• Confidence: Conf(X,Y) = #XY/#X = p(XY)/p(X) = p(Y|X)

Task: Find all association rules that satisfy minimum support and minimum confidence constraints.

Example association rule about readers of yellow press daily newspaper SloN (Slovenian News):

read_Love_Stories_Magazine => read_SloN

sup = 3.5% (3.5% of the whole dataset population reads both LSM and SloN)

conf = 61% (61% of those reading LSM also read SloN)

Another knowledge format are association rules. Again we have an »if« part and a »then« part, but they are this is not considered only as as an application but it is considered as expressing an association between the items on the left hand side and the items on at the right. hand aside. And Tthe quality of such a rule is not supposed to be evaluated by its the classification properties, as was like previously the case with decision trees and classification rules. Twhere, the main aim of the rules ose was to act as a very good classifier that when we want to distinguishes between the classes (like does the persons reading the Evening news yes or not), and we should provide such a the classification should be as occurate as possible. Also with decision trees let's say does somebody read Antena yes or no. Then we want to optimize the classification occuracy of such an induced model. With association rules, Here we are not optimizing classification aoccuracy, because actually our data can actually be such that our goal is not even to distinguishing between the classes, but only to just finding some interesting association between the items.An association rule X => Y is qualified Here we evaluate the rule by it support and confidence , where support is measured as the a conditional probability that X and Y hold jointly, and the confidence is the conditional of proprobability of Y given X.And Wwe compute the probability of something by counting how many times it happens and dividing by the number of all the instances. So, to find out the support of XY, we count the evaluated by the number of samples for which both X and Y is are true at the same time and then if we compute this probability by the relative frequency computing it from the data set it is the number of instances which is X holds and Y holds dievided by the size of the entire data set.population. To find the Whereas here the confidence of XY, we then divide the support by would be conditional the probability of Y given X which would be the support of XY devided by the X

Page 13: I-1-ed.doc

itself. Instead of probabilities and relative probabilities, we can also talk about (relative) percentages.

Let's look at an example see some association rule: and discover the association rule: if somebody reads Love Sstories magazine, then this person reads Slovenian news (Slovenske novice). Support would be 3.,5 %, which means that 3.,5% of the whole population reads both Love Sstory magazine and Slovenian news Slovenske novice. And confidence is 61%, which means that 61% of those reading Love Sstories magazine read also read Slovenian newsSlovenske novice. Which is an interesting piece of information.

In this example, we chose to use Let's look at other association rules, here we a are simpleifying a form of association rules, since we had in such a way that we have just one item in the condition and one item in the conclusion. because we have decided so. But in general, we could have a conjunction of items in the condition and a conjunction of items in the conclusion. For example, So, well known example of association rules would be: let's say if somebody buys beer and coke in the store, then he also buys diapers. This was a funny finding, and then people who analyse peoples buying habits would then put all these things in the store together. So Wwhen ever you go and buy beer, you are also likely would also to buy peanuts, and that is's how people organize their strores. Having Finding such found such associations is interesting was for trying to analyse the buying habits of people.

Association rules

Finding profiles of readers of the Delo daily newspaper

1. read_Marketing magazine 116 =>read_Delo 95 (0.82)

2. read_Financial_News 223 => read_Delo 180 (0.81)

3. read_Views 201 => read_Delo 157 (0.78)

4. read_Money 197 => read_Delo 150 (0.76)

5. read_Vip 181 => read_Delo 134 (0.74)

Interpretation: Most readers of Marketing magazine, Financial News, Views, Money and Vip read alsoDelo.

This was another So, for instance this was interesting case, we were trying to find the readers of Delo based on other reading habits. The numbers in blue are the absolute numbers, and the ones in red express confidence. Wwe could fouind that people read Delo if they also read Marketing magazine, and if they read Financial news (Finance),. and read If they read the Views, Money and VIP. And Aall these rules hadve quite strong support and confidence. We could interpret this set of rules to show that like most readers of Maerketing magazine, Financial news, Money and VIP are also readers of Delo, and that was pretty interesting for the client Mediana agency.

Question: About these decision trees and classification rules(so, move up before AR???): aren't they just a transformation of one another?

Answer: You can always transform a the decision tree into a set of classification rules, but the vice versa is a bit more tricky. The standard way is to would be that you induce a the decision tree and then you transform it into a set of rules. We will see that in some cases learning rules is better, and in some cases learning decision trees is better.

Page 14: I-1-ed.doc

And just to provide you with some intuition: learning rules gives provide you for so called characteristic descriptions, so when you learn a rule you first decide which for instance first about the class which you would like to describe with itthese rules like. For example, :« reads AntenaAntena« in the decision tree example , tha't woulds be the class, and you'd like to build the rule which will characterize the readers of that magazine Antena. On the other hand, with decision trees you are building a model which best splits between readers from and non-e readers. The idea is slightly different: with rules you are building characteristic descriptions for each particular class, and with a the decision tree you are building an entire model of the entire domain. And sometimes one a property like »age below 25«, which was important in the Antena example, may be not be that interesting as a characteristic for the readers, because it may appear there only just in order to separate between two sub sets. Perhaps a bit latter on I will be able to provide you with more intuition into this.

My basic answer would be that if you want to build a model of the domain which that can be used for classificationying, then let's say decision trees are let's say ok, because they give you a very compact model at in a glance., very compact Bbut if you would like to build characteristic descriptions for every particular class - , note that here we have just two classes. »reads« and »doesn't read«, but occasionaly we were are dealinging with five different classes - , and then decision rules would provide give you a model of each class separately, and ruiles would be builtd for individual one classes and you would build them it by contrast with against all the other classes.

There is a slightly different point of view here that goes with the when using different technologies. Certainly decision trees can be ttransformed into rules, but whereas rules can hardly be transformed into a decision tree.

Analysis of UK traffic accidents

• End-user: Hampshire County Council (HCC, UK) – Can records of road traffic accidents be analysed to

produce road safety information valuable to county surveyors?

– HCC is sponsored to carry out a research project RoadSurface Characteristics and Safety

– Research includes an analysis of the STATS19 Accident Report Form Database to identify trends overtime in the relationships between recorded road-usertype/injury, vehicle position/damage, and road surfacecharacteristics

In our next example, Here we were analysing UK traffic accidents., We had a the data base, which was a quite large, and it contained data base about 5 million accidents, were recorded in over 20 years.

Page 15: I-1-ed.doc

STATS19 Data Base

10

• Over 5 million accidents recorded in 1979-1999

• 3 data tables

Accident ACC7999 (~5 mil. Accidents,

30 variables)

Where ? When ? How many ?

Vehicle VEH7999 (~9 mil. Vehicles,

24 variables)

Which vehicles ? What movement ? Which

consequences ?

Casualty CAS7999 (~7 mil.injuries, 16 variables)

Who was injured ? What injuries ? ...

We had three relational data tables of relational data, one about the accidents, about 5 million accidents described with 30 variables, another about the vehicles which were involved in those accidents, with ???, which was about 9 million of them, vehicles and one then about the casualties - about people who were injured in those accidents. There were about 7 million injuries.

Data understanding

79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

220000

230000

240000

250000

260000

270000

Year of Accident

That was about 7 million injuries and Ffirst, we did simply tried to e data understandding the data, by trying to depicting how many accidents occured in different years ., Wwe could see a slightly decreasing trend in the number of accidents, so, an obvious step was way to do would be to have a regression line trying to mimic the downward - grading trend of the number of accidents. The line would explain the downward trend in terms of so this has to do with better roads and better safety of the vehicles obviously. And if you do the data anlysis in some slightly more in detailed and you distinguish between different accidents according to the severity, . yYou would see that the number of severe accidents drops (red line), the number of medium accidents also drops (green line), but the number of slight (non-severe, valjda) accidents stays the same.

Page 16: I-1-ed.doc

Data quality: Accident location

It is And then an interesting thing is to see how good is the data. For In some years, such as 1999, the data was good, so here we just depicted the location where the accidents happened. so here in this year (it was 1999) we could see that in some parts of England there were many accidents in some other years as you see like this For other years, such as 1979, the location was not very extremely reliable. You could see car accidents occuring in the seae and so on. Also Yyou could also see that there were no accidents in London, which is hard to believe. By depicting , having just depicted the data is you can see how reliable the data is, and you can either decide to simply ignore those years with such unreliable data, or do something more clever with data cleaning. So, Iit would be very awkward cumbersome to come up with the rule: in year 1979 there were no accidents in London. We should be always be a little bit carefull with the results of data analysis.Then, we did some preparation of the data.

Data preparation

• There are 51 police force areas in UK

• For each area we count the number of accidents in each:– Year

– Month

– Day of Week

– Hour of Day

Then data preparation took place.

The data was first divided by area, corresponding to the different police force areas, and then further divided by year, month, day of the week and hour.

Page 17: I-1-ed.doc

Data preparationYEAR

pfc 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

a 10023 9431 9314 8965 8655 9014 9481 9069 8705 8829 9399 9229 8738 8199 7453 7613 7602 7042 7381 7362 6905b 6827 6895 6952 7032 6778 6944 6387 6440 6141 5924 6331 6233 5950 6185 5910 6161 5814 6263 5881 5855 5780c 2409 2315 2258 2286 2022 2169 2212 2096 1989 1917 2137 2072 2032 1961 1653 1526 1552 1448 1521 1408 1234

MONTH

pfc jan feb m ar apr m ay jun jul aug sep oct nov dec

a 72493 67250 77434 73841 78813 78597 80349 74226 79362 85675 84800 76282b 2941 2771 3145 3317 3557 3668 3988 4048 3822 3794 3603 3481c 9261 8574 9651 9887 10649 10590 10813 11299 10810 11614 10884 10306

DAY OF WEEK

12 Sunday Monday Tuesday Wednesday Thursday Friday Saturday

a 96666 132845 137102 138197 142662 155752 125898b 5526 5741 5502 5679 6103 7074 6510c 15350 17131 16915 17116 18282 21000 18544

HOURpfc 0 1 2 3 4 5 6 7 8 … 16 17 18 19 20 21 22 23

a 794 626 494 242 166 292 501 1451 2284 … 3851 3538 2557 2375 1786 1394 1302 1415b 2186 1567 1477 649 370 521 1004 4099 7655 … 11500 11140 7720 7129 5445 4396 3946 4777c 2468 1540 1714 811 401 399 888 3577 8304 … 12112 12259 8701 7825 6216 4809 4027 4821

Data was devided into different areas according to different police force areas and we prepared the data according to the year, month, week and hour.

Simple visualization of short time series

• Used for data understanding

• Very informative and easy to understand format

• UK traffic accident analysis: Distributions of number of accidents over different time periods (year, month, day of week, and hour)

The next step was And then we did some very simple visualisation of the data.

Year/Month distributionJan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Darker color - MORE accidents

Here, you see the year/month distribution and the other temporal distributions : the darker areas represent periods with the number of accidents (more accidents, and lighter ones those ) and the with less dark less accidents.

Page 18: I-1-ed.doc

All weekdays (Mon – Fri) are worse in deep winter, Friday the worst

SUN

FRI

SAT

MON

THU

TUES

WED

Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec

Day of Week/Month distribution

With these simple visualisation that can be done in Excel you can say: if we put the days of the week like Thursday, Friday, Saturday on the Y axis, and the months of the year like October, September, December on the X axis, we can see that the largest number of accidents happened in the winter months, especially on Fridays.

Hour/Month distribution

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sept

Oct

Nov

Dec

1. More Accidents at “Rush Hour”, Afternoon Rush hour is the worst

2. More holiday traffic (less rush hour) in August

Similarly, for the hour/month distribution: if we put the months on the Y and the hours of the day on the X axis, we see that most accidents happened in the morning and in the late afternoon, during rush hours. That is obviously what one would expect.

SUN

FRI

SAT

MON

THU

TUES

WED

1. More Accidents at “Rush Hour”, Afternoon Rush hour is theworst and lasts longer with “early finish” on Fridays

2. More leisure traffic on Saturday/Sunday

Day of Week/Hour distribution

The week/hour distribution also gives a similar view.So let say with simple visualisation which can be done in excel you can say: if here (ordinate) we have days of the week like Thursday, Friday, Saturday and here (abscissa) we have the months of the year like October, September, December the largest amount of accidents happened in winter months especially on a Friday and in days (hour/month distribution): if we have here (ordinate) the months and

Page 19: I-1-ed.doc

here (abscisse) we have the hours of the day, most accidents happened in the morning and in the late afternoon, in the so called 'rush hours'. That is obviously what one expects.

Some discovered association rules

• Association rules: Road number and Severity of accident– The probability of a fatal or serious accident on

the “K8” road is 2.2 times greater than the probability of fatal or serious accidents in the county generally.

– The probability of fatal accidents on the “K7”road is 2.8 times greater than the probability of fatal accidents in the county generally (when the road is dry and the speed limit = 70).

One of the And then I will show you some association rules that we discovered was with one association rule being that the probability of fatal or serious accidents on a certain road (K8) is more than two times higher then the number of fatal or and serious accidents in the country in general. So Tthis is already an important indicator for the traffic authorities. And then some other rule for another road.

Analysis of documents of European IST project

Data source:• List of IST project descriptions as 1-2 page text

summaries from the Web (database www.cordis.lu/)• IST 5FP has 2786 projects in which participate 7886

organizations

Analysis tasks:• Visualization of project topics • Analysis of collaboration• Connectedness between organizations• Community/clique identification • Thematic consortia identification• Simulation of 6FP IST

Another example, iIllustrating various data analysis techniques, was the analysis of Information Society Technologies just one more application. That is about IST (IST) projects that were was funded ounded by the European Comission during in the last 5 years. All Tthe descriptions of the projects are available on line, at the (Ccordis site). There were 2786 projects fuounded in the 5th Framework programme, involving and nearly 8000 organizations. were involved in this project. Here too, And again we were free to do whatever analysis we chosewanted: we did some visualisation, some data mining and , some web mining. And here are some of the results:

Page 20: I-1-ed.doc

Analysis of documents of European IST project

This is what the available data looked like. And let me show you some of the results. So, that's the data how it was available. Every project, such as like our Sol- Eu--Net project was described by an acronym, the project title, the URL, and a the unique identifier of the project. and the Iif you would clicked on the project name, you would get an abstract which was is in textual form. There, Then you would see have the participants, their countries, and the project duration. This at is was the basic data that which was available for the analysis. One of the things that we did was And for instance what did we do, we did some clustering the of projects based on their similarity. How was this done? We took the abstracts, which gave us some text about the representations of the projects. Then which was text. wWe transformed these he texts to the so- called »'bags of words« interpretation,' which means that we put together ting all the words that which occured in the abstracts , leaving out eliminating the non-interesting words which are like dots, commas, and articles like »a« and »-s, the«, -s which are very frequent but and have carry no information. In this way, And like that we obtained got a »bag« factor of keywords that representeding the abstract and thus . With this factor of words was represented the project itself. On this basis, Based on that we would looked into the similarities between of the projects: how similar is one one project to another according to the similarity of the words describing them. If many words occur in both abstracts the two projects are very similar. The bag-of-words representation further contains not only the words themselves, but also counts how many times a certain word occurs.factors of the words. And Tthere is also a well- known approach to computing the similarity of such factors, . It is called the »'cosine similarity«.' This similarity comes out larger if you consider it as two factors the two factors would be as similar as possible if the cosine of the angle between the two word vectors is as smaller. as possible. If many words occur in both abstracts the two projects are as similar as possible.A group or And basically the bag of words representation has not just the words there but it also counts how many times a certain word occurs.These would then be viewed as keywords representing the project. And then let's say, here we would describe the cluster of projects would consist of those projects that had with the most words in common occuring in their descriptions. of the individual project. So, like this all the projects would be clustered.

Page 21: I-1-ed.doc

Visualization into 25 project groupsHealth

Data analysis

Knowledge Management

Mobile computing

Some of the Here we have decided to cluster all the projects into 25 different groups and clusters that we fund we could see that some clusters represented data analysis projects, some clusters represented knowledge management projects, and so on. For example, in knowledge management projects, So, like the words appearing in the abstracts would be words such as »knowledge«, , »knowledge management« and »portal«. , startegic portal and so on. You could And then you would be able to click on a cluster there and get find the actual projects that which were putted together in a certain group.

TelecommunicationTransport

Electronics

No. of joint projects

Institutional Backbone of IST

It was also Another interesting to application was looking at which institutions appeared in most projects and who collaborateds with whom. Here, tThe thickness of the line between two nodes would represent the number of joint projects. If you look at the graph, which are joint. And here we have Fraunhoffer and GMD are basically that's the same institution, so it's no wonder this is a very strong link. During the 5th framework programme, GMD became during the 5th framework programme a part of Fraunhoffer, a, the large set of reseach and development institutions in Germany. You can see from this that some data preprocessing has to be done because the same institution does not always appear with the same name.

You can also see for example Then you have that the »Centre national de la recherche scientifique« which has strong collaborations with very many institutions, especially as you see other institutions in France. like the one in Grenoble, France Telecom and so on. Here Wwe have only showedn the number of collaborations if there where we have more than 10 projects between institutions. If we would would lower the threash-hold for example into 5, institutions theis graph ff would become much more elaborate, and much more difficult to understand. On the links we have the number of projects and here we have the different institutions. Obviously some data preprocessing has to occur because not always the same institution appears with the same name.

Page 22: I-1-ed.doc