Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...
-
Upload
allen-brooks -
Category
Documents
-
view
212 -
download
0
Transcript of Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...
![Page 1: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/1.jpg)
Data MiningData Mining
Jim KingJim King
![Page 2: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/2.jpg)
What is Data Mining?What is Data Mining?
A.k.a. knowledge discoveryA.k.a. knowledge discovery• The search for previously unknown The search for previously unknown
relationships in large data setsrelationships in large data sets Why?Why?
• Improved technology allows for vast Improved technology allows for vast quantities of data to be gatheredquantities of data to be gathered
• Those relationships can perhaps be used Those relationships can perhaps be used to make future decisions and strategiesto make future decisions and strategies
![Page 3: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/3.jpg)
How do we Data Mine?How do we Data Mine?
Three considerations to be madeThree considerations to be made• ClassificationClassification• AssociationAssociation• SequentialSequential
![Page 4: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/4.jpg)
ClassificationClassification
Generate grouping rulesGenerate grouping rules• Future data can then be classified Future data can then be classified
quicklyquickly
Example: Disease classification Example: Disease classification based on symptoms may lead to based on symptoms may lead to better treatmentsbetter treatments
![Page 5: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/5.jpg)
AssociationAssociation
Two conditions occur togetherTwo conditions occur together
Presumptive Objective
With some probability (confidence)With some probability (confidence)
Cond1 => Cond2
![Page 6: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/6.jpg)
SequentialSequential
Event B follows Event AEvent B follows Event A
Ex. In e-commerce, what links do Ex. In e-commerce, what links do people follow?people follow?• After following links to a product, how After following links to a product, how
often do they buy?often do they buy?
![Page 7: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/7.jpg)
Classification AlgorithmsClassification Algorithms
Hard clustering vs. Soft clusteringHard clustering vs. Soft clustering• Collection of classes { C1, C2, .. Cn }Collection of classes { C1, C2, .. Cn }• Arbitrary Object OArbitrary Object O• Soft Clustering: Classes may overlap Soft Clustering: Classes may overlap
where an object belongs to multiple where an object belongs to multiple classesclasses
• Hard Clustering: Every object may Hard Clustering: Every object may belong to only one class. No overlapbelong to only one class. No overlap
![Page 8: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/8.jpg)
ClassificationClassification
One way: AgglomerativeOne way: Agglomerative• Every object is its own clusterEvery object is its own cluster• Find two objects with least distanceFind two objects with least distance• Combine into one clusterCombine into one cluster• Stop when only one cluster remainsStop when only one cluster remains• Returns hierarchy of the clusteringReturns hierarchy of the clustering
Need to decide on some distance functionNeed to decide on some distance function
![Page 9: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/9.jpg)
ClassificationClassification
Another way: Division methodAnother way: Division method• Everything initially in one clusterEverything initially in one cluster• Split into two clustersSplit into two clusters• Split each new cluster into two more Split each new cluster into two more
clustersclusters• Stop when can’t divide any moreStop when can’t divide any more
Requires more computational power, but Requires more computational power, but usually worse resultsusually worse results
![Page 10: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/10.jpg)
Association AlgorithmsAssociation Algorithms
Given constraints, minimize the Given constraints, minimize the criteria need for a conditioncriteria need for a condition
Bought cereal & eggs -> Bought milkBought cereal & eggs -> Bought milk• 80% confidence80% confidence
Bought cereal -> Bought milkBought cereal -> Bought milk• 90% confidence90% confidence
![Page 11: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/11.jpg)
AssociationAssociation
Prune conditions which fall below Prune conditions which fall below minimum improvement yields minimum improvement yields simplificationssimplifications
Other constraints:Other constraints:• Minimum confidence ( 30% with A Minimum confidence ( 30% with A
include B)include B)• Minimum support ( 2% have both A and Minimum support ( 2% have both A and
B)B)
![Page 12: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/12.jpg)
Sequential AlgorithmsSequential Algorithms
People buy basic camping equipmentPeople buy basic camping equipment Later buy other items relatedLater buy other items related
Starting with basic item sets, try to Starting with basic item sets, try to concatenate and find the resulting concatenate and find the resulting set among customer behaviorset among customer behavior
![Page 13: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/13.jpg)
SequentialSequential
If resulting item set is not supported If resulting item set is not supported (at all or above a threshold), drop it(at all or above a threshold), drop it
Sequences do not have to be Sequences do not have to be contiguouscontiguous• i.e. A customer buys A then B then C, i.e. A customer buys A then B then C,
sequence A then C is validsequence A then C is valid
![Page 14: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/14.jpg)
Case Study - SchulWebCase Study - SchulWeb
Search Site for schools in GermanySearch Site for schools in Germany How to improve performance and How to improve performance and
user satisfaction?user satisfaction?
Use log to track user navigation Use log to track user navigation patterns (i.e. What URLs requested, patterns (i.e. What URLs requested, what order?)what order?)
Extract Information from theseExtract Information from these
![Page 15: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/15.jpg)
Interpretations of MiningInterpretations of Mining
Users don’t like to type textUsers don’t like to type text Prefer to select from available choicesPrefer to select from available choices
What were they looking for?What were they looking for?• Schools close to some regionSchools close to some region• Used option to specify a state (for location)Used option to specify a state (for location)• Used option to specify a school type (to limit Used option to specify a school type (to limit
search size)search size)
![Page 16: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/16.jpg)
Changes MadeChanges Made
Made “Near Town” DefaultMade “Near Town” Default• Made option obvious, people started to Made option obvious, people started to
useuse• Limited region size further, short lists Limited region size further, short lists
producedproduced• Shorter lists less intimidating, more Shorter lists less intimidating, more
people found what they needpeople found what they need
![Page 17: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/17.jpg)
ConclusionsConclusions
Data mining is a useful tool with Data mining is a useful tool with multiple algorithms that can be multiple algorithms that can be tuned for specific taskstuned for specific tasks
Can benefit business, medicine, Can benefit business, medicine, sciencescience
More efficient algorithms needed to More efficient algorithms needed to speed up data mining processspeed up data mining process
![Page 18: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/18.jpg)
ConclusionsConclusions
Making Data mining easier to useMaking Data mining easier to use• Data with rich descriptions (more fields)Data with rich descriptions (more fields)• More Data/RecordsMore Data/Records• Controlled/Reliable Data Collection Controlled/Reliable Data Collection
(automated vs. manual)(automated vs. manual)• Way to evaluate resultsWay to evaluate results• Integrate information gained back into Integrate information gained back into
systemsystem
![Page 19: Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.](https://reader035.fdocuments.us/reader035/viewer/2022072017/56649efd5503460f94c118d0/html5/thumbnails/19.jpg)
Final Questions?Final Questions?
www.cs.unr.edu/~kingwww.cs.unr.edu/~king