1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by...
-
Upload
matthew-cannon -
Category
Documents
-
view
220 -
download
1
Transcript of 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by...
1
Data Mining Workbenches: a overview &comparison focusing on open-source packages
CS240B notes by C. Zaniolo
2
Comparing KDD/DM Toolsets
Many packages and very few in-depth comparisons An Evaluation by USDA Forest Service comparing
R, WEKA, Orange, and SAS® Several User-satisfaction/popularity surveys
KDD-nuggets Rexer Analytics Survey (annual)
3
An Evaluation of CART Programs by USDA Forest Service (USFS) By USDA Forest Service (USFS)
USFS uses classification and regression-tree (CART) technology to map USFS Forest Inventory and Analysis
(FIA) biomass, forest type, forest type groups, and National Forest vegetation.
The results of the study were reported by: B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and Dan Wendt “Evaluation of Open Source Data Mining Software Packages”, Symposium on Forest Inventory and Analysis (FIA), October 2008; Park City,UT.
Proc.
4
R: (http://www.r-project.org)
By the University of Auckland, NZ, in 1993 GNU Public License (GPL) in 1995. An extension of the S language (Bell Labs) Twelve packages are supplied with the basic
R distribution each including many functions http://cran.r-project.org offers 1,364 additional
packages extending the basic R functionality.
5
WEKA: www.cs.waikato.ac.nz/ml/weka/ Waikato Environment for Knowledge Analysis by the University of Waikato, New Zealand, which
supports the software with funds by the NZ government. Starded in 1993 and released in 1996. A GPL package WEKA is a collection of machine-learning algorithms
implemented in Java plus data preprocessing tools, and visualization tools,
interface tools (R, SQL)
6
Orange: www.ailab.si/orange/
By the University of Ljubljana, Slovenia, in 2004, under GPL. Still evolving: frequent new releases
Main routines & libraries in C++ but Python is used to call the routines and access libraries www.ailab.si/orange/doc/ofb/.
Users can add their machine-learning algorithms using both scripting and GUI environments
Orange also has a GUI version called Orange Canvas, which allows for interactive machine-learning “visual programming”.
7
SAS® (Statistical Analysis Software) By Jim Goodnight and North Carolina State University
associates in early 1970s. In 1976 the SAS-Institute was founded to distribute and further develop the increasingly popular software.
SAS® currently has 10,658 employees, and is the largest privately held software company with annual revenue of $2.15 billion (in 2007) SAS® is used in 109 countries, different industries, with 44,000 customer sites worldwide.
SAS® is purchased by contacting a distributor directly: it can cost several thousand dollars depending on the options. The purchase includes the software, technical support, and licenses, which are renewed regularly, incurring more costs.
8
Evaluation Criteria Cost Usability:
How easy is the interface to use and understand? Are there a variety of models and options available? How easy to use is the software’s programming language? Does the software integrate easily with other programs?
Performance w.r.t. speed, stability, and accuracy.
Critical Mass: how widespread is the software? Uniqueness of useful features & algorithms Defensibility w.r.t.citations and academic repute
9
Usability SAS®: The Enterprise Guide for SAS® has a user-friendly GUI
system that allows for the building of graphical models. GUIs also exist for other SAS® modules, but unlike WEKA and
Orange there is no universal GUI for SAS SAS® is primarily driven by its own programming language, a
new user will require some training R, like SAS®, is used by numerous industries and thus
has a wide variety of models and options. R is driven by its own scripting language, which does require
some training and/or experience GUIs for specific functions only.
10
Usability (Cont.) WEKA does have a comprehensive GUI with many models and
options available. WEKA’s GUI is easy for users need a good understanding of modeling techniques. to integrate WEKA with other software programs Familiarity with Java is needed to extend WEKA and link with other software
programs WEKA can be expanded and used within R,
Orange: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Orange website (http://www.ailab.si/orange/) Orange has a good website on how
to integrate Orange with Python. The number of models and options available in Orange lags behind not only
SAS® and R but WEKA as well.
11
Performance notes
R significantly faster than WEKA and Orange on classification trees.
Orange is the least stable although new versions are released monthly
WEKA is a stable program, but also does not work well with large datasets. The weka recently recently introduced MOA to
process massive data sets in a stream-like mode.
13
Most Popular Data Mining SoftwareRexer Analytics Survey (Early 2007) asked
about the tools used often and occasionally. Clearly more popular than the rest were:
SPSS or SPSS Clementine "Own Code" SAS or SAS Enterprise Miner
Followed by R Weka C4.5 / C5.0
14
Critical Mass and PopularityTop ten most used packages by KDD Nuggets Survey (May 2007):
SPSS/ SPSS Clementine Salford Systems CART/MARS/TreeNet/RF Yale (now Rapid Miner) SAS / SAS Enterprise Miner Angoss Knowledge Studio / Knowledge Seeker KXEN Weka R Microsoft SQL Server? MATLAB?
Note: Microsoft Excel omitted as it's not really "data mining" software, and I've merged the tools offered by a single vendor (SPSS and SAS)
You can see the full survey results
15
Comments Gregory Piatetsky-Shapiro, KDnuggets Editor:Votes from tool vendors were removed..Comparing with 2008 KDnuggets Poll on data mining tools/software used,the big changes are growth in SPSS, RapidMiner, and R.
16
Popular Data Mining Software (cont.)Rexer Analytics Survey is taken every year and the
summary report can be obtained free. 2009 SURVEY HIGHLIGHTS:
Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.
SAS Enterprise Miner dropped in data miner’s tool rankings 2010 SURVEY HIGHLIGHTS:
R: After a steady rise across the past few years, R overtook other tools to become the tool used by more data miners (43%)
STATISTICA has also been climbing in the rankings. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.
18
Selected References Witten, I.H.; Frank, E. Data Mining: Practical machine
learning tools and techniques. 2nd Edition, Morgan Kaufmann, 2005.
R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, 2008.
Demsar J.; Zupan, B.; Leban, G.. “Orange: From experimental machine learning to interactive data mining”, 2004. (http://www.ailab.si/orange).
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008.
19
About Weka Comparison to R, WEKA is weaker in classical statistics but
stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse
areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models
(Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models.
WEKA can interface with many systems and formats: SQL, LibSVM and SVM-Light,….
WEKA has 2 limitations: Java implementation is somewhat slower than an equivalent in
C/C++ Most of the algorithms require all the data stored in main
memory. So it restricts application to small or medium-sized datasets.
20
MOA: Massive Online Analysis MOA supports bi-directional interaction with WEKA
to deal with the scaling up the implementation of state of the art algorithms to real world dataset sizes using a streaming settings
MOA: a software environment for testing algorithms and running experiments for online learning from evolving data streams
A DSMS will then be required to deploy these algorithms on actual data streams—MOA is not a DSMS
21
Downloads available under GNU GPL license Several Data Sets used:
SEA Concepts Generator: artificial dataset with abrupt concept drift STAGGER Concepts Generator by Schlimmer and Grange Rotating Hyperplane: used as testbed for CVFDT versus VFDT Random RBF Generator Waveform Generator Function Generator It was introduced by Agrawal et al.
MOA Currently supports:Classification and clustering methods
System is easily extensible and has nice GUIGood Documentation:
Albert Bifet, G. Holmes, R. Kirkby & B. Pfahringer: DATA STREAM MINING: A Practical Approach. May 2011.
Albert Bifet et al.: MOA: Massive Online Analysis, a Framework for Stream Classication and Clustering (2010)