Lee McCluskey, room 3/10 Email [email protected] scom.hud.ac.uk/scomtlm/cha2555

18
AI Week 15 Machine Learning: Data Mining : Association Rule Mining, Associative Classification, Applications Lee McCluskey, room 3/10 Email [email protected] http://scom.hud.ac.uk/scomtlm/ cha2555/

description

Lee McCluskey, room 3/10 Email [email protected] http://scom.hud.ac.uk/scomtlm/cha2555/. AI Week 15 Machine Learning: Data Mining : Association Rule Mining, Associative Classification, Applications. Last Week. Data Mining -- as inducing rule classifiers from classified training examples. - PowerPoint PPT Presentation

Transcript of Lee McCluskey, room 3/10 Email [email protected] scom.hud.ac.uk/scomtlm/cha2555

Page 1: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

AI Week 15Machine Learning:Data Mining :Association Rule Mining, Associative Classification,Applications

Lee McCluskey, room 3/10Email [email protected]

http://scom.hud.ac.uk/scomtlm/cha2555/

Page 2: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Last Week

Data Mining --as inducing rule classifiers from classified training examples.

Page 3: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Association Rule Mining(ARM)This is an “unsupervised learning activity” - briefly,

looking for strong associations between features in data.

Definitions: A transactional database is a set of “transactions” eg the details of individual sales.

A transaction can be though of as an “item-set” where each item is an attribute-value

{height=6, temp = 20. weather = warm} As a special case we could have nominal item sets{bread, cheese, milk}

Page 4: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Association Rule Mining(ARM): Important Definitions

An association rule is an expression

X => Ywhere X, Y are item-sets, and

The support of an association rule is defined as the proportion of transactions in the database that contain

X U Y. The confidence of an association rule is defined as the

probability that a transaction contains Y given that it contains X, that is

= no of transactions containing (X U Y) / no of transactions containing X

Page 5: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Aims of ARM Given a transactional database D, the association rule

problem is to find all rules that have supports and confidences greater than certain user-specified thresholds, denoted by minimum support (MinSupp) and minimum confidence (MinConf), respectively.

The aim is the discovery of the most significant associations between the items in a transactional data set. This process involves primarily the discovery of so called frequent item-sets, i.e. item-sets that occurred in the transactional data set above MinSupp and MinConf.

Page 6: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Example A trader deals in the following currencies in a series of 8 transactions…1 Sterling Yen Dollar Euro2 Dollar Euro Rand Sterling Ruble3 Pesos Euro Ruble Rupee Yen4 Rupee Sterling Ruble Euro Dollar5 Sterling Dinars Rand Yen6 Pesos Kroner Sterling Dollar7 Ruble Rupee Kroner Sterling Pesos8 Dollar Euro SterlingWhat is the SUPPORT and CONFIDENCE of the following rules?{Ruble } → {Rupee}{Sterling, Euro} → {Ruble} {Sterling, Euro} → {Ruble, Pesos}

Find an association rule from the set of transactions that has - at least 2 items in its antecedents, - better support and better confidence than both rules above.

Page 7: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Example Sterling Yen Dollar Euro

Sterling Yen Dollar Euro Sterling Yen Dollar Euro

Pesos Euro Ruble Rupee Yen

Rupee Sterling Ruble Euro Dollar

Sterling Dinars Rand Yen

Dollar Euro Rand Sterling Ruble

Pesos Kroner Sterling Dollar

Ruble Rupee Kroner Sterling Pesos

Dollar EuroSterling

X

X u YRX => Y:Ruble => Rupee

Page 8: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Associative ClassificationIf we fuse ARM and classification rule mining we get

“Associative Classification” – use the association technique, but learning about particular items or item sets.

Associative Classification is a branch in data mining that combines classification and association rule mining. In other words, it utlises association rule discovery methods in classification data sets.

Typically:• Find Association Rules using ARM• Sift out the “Class Association Rules” – ones that have the

class of interest on their Right Hand Sides

Page 9: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Validation in Rule Discovery• Multi-stage Data Mining “pipelines” are fraught with

various kinds of errors / bias• the integrity of the data at each stage of the DM

process and the reliability of the results are particularly important.

• DM usually uses “cross validation”, where the data is split into a training set and a testing set, and the results of the data miner applied to the training set is compared to the training set. Not really applicable to rule discovery.

Key idea: Look for trends/associations in the data that are output from the process and that represent known associations in the application domain.

Page 10: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

DM Application 1: Discovering trends from patient data in the area of Diabetic Retinopathy

Diabetic Retinopathy: Basically damage to the eyes caused by Diabetes, sometimes leading to blindness

HUGE problem as diabetes on the increase. If you are a long term diabetic then your are very likely suffer some retina damage

Clinics keep large amounts of data on patients who are treated in various ways, over long periods of time.

Page 11: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Diabetic Retinopathy ApplicationData of 20,000 patients over 18 years Much data cleaning and inference precedes mining –

replacing missing values, noise, anomalies etc Focus in one a smaller number of patients with a yearly

screening (- timestamp) over a period of 4+ yearsAttribute Examples (there are several hundred)Age_at_Exam , Present_Treatment, calculated_age_at_diagnosis, Retinopathy_in_R_Eye (RE_RET),Retinopathy_in_L_Eye (RE_RET),calculated_diabetes_type,calculated_diabetes_duration

Page 12: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Trend MiningItem-sets that have an

increasing support over a series of time-stamped instances (events) are called “emerging patterns”

The changing support for sets of items during each event can indicate trends in the data. For example, the presence of a particular treatment over a period of time may lead to the alleviation of a symptom.

Page 13: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Diabetic Retinopathy ApplicationAim - to find trends in the data e.g. (ficticous

example):calculated_diabetes_duration > Y &Age_at_Exam in [60,70] &Present_Treatment = drugX &calculated_age_at_diagnosis in [50,60] => Retinopathy_in_R_Eye (RE_RET) = lowRetinopathy_in_L_Eye (RE_RET) =lowIncreasing trend .. “people who have had diabetes for a certain length of time, whose age

is in there 60’s, who were diagnosed in their 50’s, who have been taking treatmentX, often have low DR levels”

Increasing trend adds support for the association.

Page 14: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

DM Application 1: Road Traffic Control

Page 15: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Example in Road Traffic Control

Page 16: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Example in Road Traffic ControlData ..Numeric Data Record from individual CARS(date, time, position, actual speed, expected speed)Textual Data of INCIDENTS(date, time start, time cleared, position, severity, road type,

area, incident category, cause, road-effect, traffic-effect, reporter ..)

Data Sources ..ANPR, Mobile Phones, Road (Vehicle) Sensors,

Environment Sensors

Page 17: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

Applications in Road Traffic Control• associations between variations in speeds with

near-future incidents • effect of a particular type of incident (eg

roadworks) on average speeds on nearby trunk roads

• looking for predictors in "heavy/slow traffic" incidents: look for associations with speed variations or accidents on roads downstream from the incident position (hence causing the incident)

• looking for associations between speeds around a bypass and a later "heavy traffic" incident within the town bypassed

Page 18: Lee McCluskey, room 3/10 Email  lee@hud.ac.uk scom.hud.ac.uk/scomtlm/cha2555

Artform Research Group

ConclusionsData Mining is a powerful set of techniques

to help discover hidden knowledge

It can be supervised or unsupervised.

• Association Rule Mining• Associative Classification

are important classes of technique used in DM