Topic Maps for Association Rule Mining
-
Upload
tmra -
Category
Technology
-
view
2.083 -
download
0
description
Transcript of Topic Maps for Association Rule Mining
Topic Maps for
Association Rule Mining
Tomáš Kliegr, Jan Zemánek, Marek Ovečka
Department of Information and Knowledge EngineeringFaculty of Informatics and Statistics
University of Economics, Prague
Data Mining using CRISP-DM
The goal of data mining is to obtain useful non-trivial patterns from the data.
Analytical Report
Common data mining tasks
Clustering Classification
Sex(M) and Salary(Low) and District(Havlickuv Brod) => Quality(Bad)
Association rules
Association Rule MiningEXAMPLEUnlike clustering and classification, association rules provide true “nuggets” – rules
meeting selected interest measuresDuration(2y+)and District(Prague)=> Loan Quality(good)
THE QUEST FOR TOPIC MAPS
Antecedent Consequent
Select the really interesting rules from the rules output automatically.Help searching through the results.
THE PROBLEM WITH INTEREST MEASURESIt is usually not possible to tweak the interest measure thresholds so that only the really interesting rules are output. To be on the safe side, we often get (many!) more rules than desired,
The quest
- Past results
- Background knowledge
- Redundant rules
Discovered nuggetsMore precise tasks
orAutomatic rule filtering
The lingua franca for exchange of data mining models is PMML
Predictive Modeling Markup Language• XML Schema• PMML is the leading standard for
statistical and data mining models• Supported by over 20 vendors and
organizations• Covers the technical part of the
CRISP-DM Cycle
http://www.dmg.org/pmml_examples/index.html
PMML is “just” an XML Schema
• Developed for deploying mining models • Good for migration from one data mining
environment to anotherBut:• No explicit links between nodes• Verbose• Self-contained. Lacks support for– Interlinking multiple PMML documents– Interlinking PMML with other information
Association Rule Mining Ontology
The ontology is a „semantization“ of PMML XML Schema
DESIGN GUIDELINESThe key design principle was to allow easy transformation of data from PMML to AROn
SCOPEThe ontology is limited to the subset of PMML relevant toassociation rule mining. 60 topic types, 50 association types and 20 occurence types
USENo automatic transformation is yet available, but we are working on one using OKS framework. Currently, data can be input using Ontopoly.
• xs:element is mapped to topic type• Topics are assigned same names as PMML Nodes
– But respecting spaces between words and capitalization
• Superclasses are introduced for semantically similar XML Nodes
• Named elements used as children in other elements that carry most of the semantics of their parents are merged with parent
• If an XML element has a directly corresponding topic type in the ontology, the URI of the XML element within the schema is used as subject identifier
Design guidelines: Elements
Design guidelines: Attributes• Enumeration restriction on an attribute is mapped as a topic type with enumeration
superclass (this is a workaround for missing TMCL support in OKS)
• Attributes that could be interpreted as reference to other elements become associations
• Other attributes become occurence types
Design guidelines: Associations• Names for association types are arbitrarily chosen so that they are most
descriptive• Introduce less rather than more associations
– minimizes the effort when populating the ontology from PMML– Avoid unnecessary inflation of the topic map
• Link only the semantically closest topics– Additional „soft“ relations can be introduced with inference statements or derived with tolog
Design guidelines: Role types
• Topic types used to map PMML elements are used as role types– Unless multiple topics are permitted in association end. In that case
superclass is used as a role type, or a new role type is introduced
Two alternative association rulerepresentations-Apriori based(Item-Itemset)-GUHA based(Boolean Attributes)
Ongoing work
• Support for background knowledge „already known association rules“
• Support for schema mapping „linking of background knowledge with mining results“
• Already in the ontology, distinguished by base of subject identifier
Schema Mapping• http://keg.vse.cz/sma/XXXBackground Knowledge• http://keg.vse.cz/bko/xxx
Data Mining Use case
PREDICT LOAN QUALITYFind client characteristics that could be used to predict their attitude to paying back a loan.
BASED ON PAST RECORDS Input data: records on already given loans
The data
• 6181 clients in the PKDD’99 financial dataset
Data were preprocessed, i.e.District districtPrague PragueBrno Brno… …
duration Duration
Many distinct values in<0;100>
<0;12>
<13;23>
<24;inf>
status statusAggA GoodB MediumC
BadD
ID sex age duration district Loan quality
5464 male 54 12 [months] Prague A
5489 female 20 6 months Ostrava E
… .. .. .. .. ..
• ….And perhaps 9997 other association rules
Preprocessed data
Association Rule Learner
WE CAN’T PRESENT ALL 10.000 RULES TO THE CLIENT
ASK CLIENT WHAT HE KNOWS
If loan duration is more than two years and the loan was given in Prague district, we can expect good loan quality.
…background knowledge
Semantize the results
Formalize Background Knowledge
Schema Mapping• Background knowledge can use different “vocabulary” than the data • If we are to use background knowledge in querying, we need to interlink
them with data.
The same approach would apply if we interlink several mining models (PMMLs)
Deleting information with Topic Maps
• Find association rules that subsume background knowledge
Visualization of a tolog query
Summary
• Methodology for transferring XML Schema to Topic Maps
• Association Rule Mining Ontology based on PMML• Easily extensible to other data mining algorithms• Initial attempts to formalize background knowledge• Initial attempts to use Topic Maps for schema mapping
AROn On-Line: http://maiana.topicmapslab.de/u/lmaicher/tm/kliegr