WEKA Tutorial Summer Institute 2012
Transcript of WEKA Tutorial Summer Institute 2012
SAN DIEGO SUPERCOMPUTER CENTER
An Introduction to WEKA
As presented by PACE8/9/2012
SAN DIEGO SUPERCOMPUTER CENTER
Content
• What is WEKA?• The Explorer Application
• Preprocess• Classify• Cluster• Associate• Select Attributes• Visualize
• Weka on Trestles• References and Resources
2 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
What is WEKA?• Weka is a bird found only in New Zealand. • Waikato Environment for Knowledge Analysis
• Weka is a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand.
• Weka is a collection of machine learning algorithms for data mining tasks.
• The algorithms can either be applied directly to a dataset or called from Java code.
• Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
• Weka is open source software in JAVA issued under the GNU General Public License
3 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Download and Install WEKA
• Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html
• Support multiple platforms (written in java):• Windows, Mac OS X and Linux
• Datasets(iris.arff, weather.arff)• Available on Trestles at: /home/diag/opt/weka/data• Available with Download: …../weka/data/
4 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Main Features
• 49 data preprocessing tools• 76 classification/regression algorithms• 8 clustering algorithms• 3 algorithms for finding association rules• 15 attribute/subset evaluators + 10 search
algorithms for feature selection
5 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Main GUI• Three graphical user interfaces
• “The Explorer” (exploratory data analysis)• pre-process data• build “classifiers” • cluster data• find associations• attribute selection• data visualization
• “The Experimenter” (experimental environment)• used to compare performance of different learning
schemes
• “The KnowledgeFlow” (new process model inspired interface)
• Java-Beans-based interface for setting up and running machine learning experiments.
• Command line Interface (“Simple CLI”)
6 04/11/23More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
SAN DIEGO SUPERCOMPUTER CENTER
Content
• What is WEKA?• The Explorer:
• Preprocess• Classify• Cluster• Associate• Select Attributes• Visualize
• Weka on Trestles• References and Resources
7 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER8University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess
• Data format• Uses flat text files to describe the data• Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary• Data can also be read from a URL or from an SQL
database (using JDBC)
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: ARFF file format@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html
SAN DIEGO SUPERCOMPUTER CENTER11
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER12
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER13
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER14
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER15
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER16
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER17
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess
• Used to define filters to transform Data.
• WEKA contains filters for:• Discretization, normalization, resampling,
attribute selection, transforming, combining attributes, etc
SAN DIEGO SUPERCOMPUTER CENTER19
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER20
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER21
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER22
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER23
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER24
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER25
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER26
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER27
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER28
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER29
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER30
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER31
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting nominal or numeric quantities
• Implemented learning schemes include:• Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:• Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, …
SAN DIEGO SUPERCOMPUTER CENTER
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
33April 11, 2023
This follows an example of Quinlan’s ID3
Decision Tree Induction: Training Dataset
SAN DIEGO SUPERCOMPUTER CENTER34April 11, 2023
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
Output: A Decision Tree for “buys_computer”
SAN DIEGO SUPERCOMPUTER CENTER36
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER37
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER38
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER39
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER40
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER41
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER42
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER43
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER44
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER45
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER46
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER47
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER48
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER49
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER50
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER51
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER52
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER53
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER54
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER55
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER56
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Select Attributes
• Panel that can be used to investigate which (subsets of) attributes are the most predictive ones
• Attribute selection methods contain two parts:• A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking• An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
• Very flexible: WEKA allows (almost) arbitrary combinations of these two
57
04/11/23
SAN DIEGO SUPERCOMPUTER CENTER58
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER59
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER60
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER61
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER62
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER63
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER64
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER65
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Visualize
• Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem
• WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)• To do: rotating 3-d visualizations (Xgobi-style)
• Color-coded class values• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)• “Zoom-in” function
6604/11/23
SAN DIEGO SUPERCOMPUTER CENTER67
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER68
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER69
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER70
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER71
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER72
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER73
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER74
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER75
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER76
University of Waikato 04/11/23
SAN DIEGO SUPERCOMPUTER CENTER
Using Weka On Trestles
SAN DIEGO SUPERCOMPUTER CENTER
Using Weka on Trestles
• Shared Resources• Batch and Interactive• Use GUI and Command Line• Use GUI on login nodes to create command line• Use command line to run interactive or batch
jobs on production nodes
SAN DIEGO SUPERCOMPUTER CENTER
Weka Gui
• To launch Weka Gui on a:• Windows machine
• to run software on remote machine with GUI requires a secure shell with x forwarding enabled to establish a remote connection and an X Server to handle the local display.
– Suggested software putty and Xming» http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html» http://www.straightrunning.com/XmingNotes/
• Linux and MAC OS X support X Forwarding• Mac users need to run Applications > Utilities > Xterm• ssh –Y [email protected]
• Load weka module• Weka installation available at: /home/diag/opt/weka• At command prompt > weka
SAN DIEGO SUPERCOMPUTER CENTER
PBS Script
SAN DIEGO SUPERCOMPUTER CENTER
Output file
SAN DIEGO SUPERCOMPUTER CENTER
Hands On with Weka
SAN DIEGO SUPERCOMPUTER CENTER
The Weather Data Set.arff file
Weather.arff file • Available on Trestles at: /home/diag/opt/weka/data• On line: http://www.hakank.org/weka/• With Weka download
Data Set:
@relation PlayTennis
@attribute day numeric@attribute outlook {Sunny, Overcast, Rain} @attribute temperature {Hot, Mild, Cool} @attribute humidity {High, Normal}@attribute wind {Weak, Strong} @attribute playTennis {Yes, No}
@data 1,Sunny,Hot,High,Weak,No,? 2,Sunny,Hot,High,Strong,No,?3,Overcast,Hot,High,Weak,Yes,? 4,Rain,Mild,High,Weak,Yes,? 5,Rain,Cool,Normal,Weak,Yes,? 6,Rain,Cool,Normal,Strong,No,?7,Overcast,Cool,Normal,Strong,Yes,?8,Sunny,Mild,High,Weak,No,?.
SAN DIEGO SUPERCOMPUTER CENTER
The Problem
• Each instance describes the facts of the day and the action of the observed person (played or no play).
• The Data Set• 14 Instances• 6 attributes (day, outlook, temp, humidity, wind, play
tennis)
• Based on the given records we can assess which factors affected the person's decision about playing tennis.
SAN DIEGO SUPERCOMPUTER CENTER
The Question
• Use j48 decision tree learner to model for class attribute play tennis
• Make prediction for “play”.• Make predictions for the ‘temperature’ attribute. Do you
need to do any additional data preparation?
SAN DIEGO SUPERCOMPUTER CENTER
Result
SAN DIEGO SUPERCOMPUTER CENTER
References and Resources• References:
• WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html
• WEKA Tutorial:• Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka. • A presentation which explains how to use Weka for exploratory data
mining. • WEKA Data Mining Book:
• Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
• WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page