Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG,...

12
Fuzzy Logic in KNIME – Modules for Approximate Reasoning – Michael R. Berthold 1 , Bernd Wiswedel 2 , and Thomas R. Gabriel 2 1 Department of Computer and Information Science, University of Konstanz, Universit¨ atsstr. 10, 78484 Konstanz, Germany E-mail: [email protected] 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: [email protected], [email protected] Abstract In this paper we describe the open source data analytics platform KNIME, focusing particularly on ex- tensions and modules supporting fuzzy sets and fuzzy learning algorithms such as fuzzy clustering al- gorithms, rule induction methods, and interactive clustering tools. In addition we outline a number of experimental extensions, which are not yet part of the open source release and present two illustrative examples from real world applications to demonstrate the power of the KNIME extensions. Keywords: KNIME, Fuzzy C-Means, Fuzzy Rules, Neighborgrams. 1. Introduction KNIME is a modular, open a platform for data inte- gration, processing, analysis, and exploration 2 . The visual representation of the analysis steps enables the entire knowledge discovery process to be intu- itively modeled and documented in a user-friendly and comprehensive fashion. KNIME is increasingly used by researchers in various areas of data mining and machine learning to give a larger audience access to their algorithms. Due to the modular nature of KNIME, it is straight- forward to add other data types such as sequences, molecules, documents, or images. However, the KNIME desktop release offers standard types for fuzzy intervals and numbers, enabling the imple- mentation of fuzzy learning algorithms as well. A previous paper 2 has already described KN- IME’s architecture and internals. A follow-up pub- lication focused on improvements in version 2.0 3 . However, for readers not yet familiar with KNIME, we provide a short overview of KNIME’s key con- cepts in the following section before we describe the integration of fuzzy concepts and learning algo- rithms in the remainder of this paper. To the best of our knowledge none of the other popular open source data analysis or workflow environments 14,9,17 include fuzzy types and learning algorithms. Many specialized open source fuzzy toolboxes exist but most are either purely in academic use or can not be used stand alone. Commercial tools, such as Mat- lab, often also offer fuzzy extensions. In this paper a KNIME is downloadable from http://www.knime.org International Journal of Computational Intelligence Systems, Vol. 6, Supplement 1 (2013), 34-45 Co-published by Atlantis Press and Taylor & Francis Copyright: the authors 34

Transcript of Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG,...

Page 1: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME– Modules for Approximate Reasoning –

Michael R. Berthold 1 , Bernd Wiswedel 2, and Thomas R. Gabriel 2

1 Department of Computer and Information Science, University of Konstanz,Universitatsstr. 10, 78484 Konstanz, GermanyE-mail: [email protected]

2 KNIME.com AG,Technoparkstrasse 1,

8005 Zurich, SwitzerlandE-mail: [email protected], [email protected]

Abstract

In this paper we describe the open source data analytics platform KNIME, focusing particularly on ex-tensions and modules supporting fuzzy sets and fuzzy learning algorithms such as fuzzy clustering al-gorithms, rule induction methods, and interactive clustering tools. In addition we outline a number ofexperimental extensions, which are not yet part of the open source release and present two illustrativeexamples from real world applications to demonstrate the power of the KNIME extensions.

Keywords: KNIME, Fuzzy C-Means, Fuzzy Rules, Neighborgrams.

1. Introduction

KNIME is a modular, opena platform for data inte-gration, processing, analysis, and exploration 2. Thevisual representation of the analysis steps enablesthe entire knowledge discovery process to be intu-itively modeled and documented in a user-friendlyand comprehensive fashion.

KNIME is increasingly used by researchers invarious areas of data mining and machine learningto give a larger audience access to their algorithms.Due to the modular nature of KNIME, it is straight-forward to add other data types such as sequences,molecules, documents, or images. However, theKNIME desktop release offers standard types forfuzzy intervals and numbers, enabling the imple-

mentation of fuzzy learning algorithms as well.

A previous paper2 has already described KN-IME’s architecture and internals. A follow-up pub-lication focused on improvements in version 2.03.However, for readers not yet familiar with KNIME,we provide a short overview of KNIME’s key con-cepts in the following section before we describethe integration of fuzzy concepts and learning algo-rithms in the remainder of this paper. To the bestof our knowledge none of the other popular opensource data analysis or workflow environments14,9,17

include fuzzy types and learning algorithms. Manyspecialized open source fuzzy toolboxes exist butmost are either purely in academic use or can not beused stand alone. Commercial tools, such as Mat-lab, often also offer fuzzy extensions. In this paper

aKNIME is downloadable from http://www.knime.org

International Journal of Computational Intelligence Systems, Vol. 6, Supplement 1 (2013), 34-45

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

34

willieb
Typewritten Text
Received 7 December 2012
willieb
Typewritten Text
Accepted 30 March 2013
Page 2: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

we focus on a complete, integrative platform whichis available open source.

We will first describe KNIME itself and pro-vide some details concerning the underlying work-flow engine. Afterwards we discuss the fuzzy exten-sions, in particular the underlying fuzzy types beforediscussing the integrated algorithms. Before show-ing two real world applications of those modules webriefly describe ongoing work.

2. KNIME

KNIME is used to build workflows. These work-flows consist of nodes that process data; the data aretransported via connections between the nodes. Aworkflow usually starts with nodes that read in datafrom some data sources, which are usually text filesor databases that can be queried by special nodes.Imported data is stored in an internal table-based for-mat consisting of columns with a certain data type(integer, string, image, molecule, etc.) and an ar-bitrary number of rows conforming to the columnspecifications. These data tables are sent along theconnections to other nodes, which first pre-processthe data, e.g. handle missing values, filter columnsor rows, partition the table into training and testdata, etc. and then for the most part build predic-tive models with machine learning algorithms likedecision trees, Naive Bayes classifiers or supportvector machines. For inspecting the results of ananalysis workflow several view nodes are available,which display the data or the trained models in var-ious ways. Fig. 1 shows a small workflow and itsnodes.

Fig. 1. A small KNIME workflow, which builds and evalu-ates a fuzzy rule set on the Iris data.

In contrast to pipelining tools such as Tavernab,

KNIME nodes first process the entire input table be-fore the results are forwarded to successor nodes.The advantages are that each node stores its resultspermanently and thus workflow execution can eas-ily be stopped at any node and resumed later on. In-termediate results can be inspected at any time andnew nodes can be inserted and may use already cre-ated data without preceding nodes having to be re-executed. The data tables are stored together withthe workflow structure and the nodes’ settings. Thesmall disadvantage of this concept is that prelimi-nary results are not available quite as soon as if realpipelining were used (i.e. sending along and pro-cessing single rows as soon as they are created).

One of KNIME’s key features is hiliting. In itssimple form, it allows the user to select and visuallymark (”hilite”) several rows in a data table. Thesesame rows are also hilited in all the views that showthe same data (or at least the hilited rows). This typeof hiliting is simply accomplished by using the 1:1correspondence among the tables’ unique row keys.However, there are several nodes that completelychange the input table’s structure and yet there isstill some relation between input and output rows.A nice example is the MoSS node that searches forfrequent fragments in a set of molecules. The node’sinput are the molecules, the output the discoveredfrequent fragments. Each of the fragments occurs inseveral molecules. Hiliting some of the fragmentsin the output table causes all molecules that con-tain these fragments to be hilited in the input table.Fig. 2 demonstrates this situation in a small work-flow where a confusion matrix is linked back to theoriginal data.

One of the important design decisions was toensure easy extensibility, so that other users canadd functionality, usually in the form of new nodes(and sometimes also data types). This has alreadybeen done by several commercial vendors (Tripos,Schrodinger, Chemical Computing Group, ...) butalso by other university groups and open source pro-grammers. The usage of Eclipse as the core plat-form means that contributing nodes in the form ofplugins is a very simple procedure. The official KN-IME website offers several extension plugins for re-

bhttp://www.taverna.org.uk

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

35

Page 3: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME

Figure 2: KNIME’s hiliting features demonstrated by the linkage between the confusion matrix and the evalua-tion data.

porting via BIRTc, statistical analysis with Rd or ex-tended machine learning capabilities from Wekae,for example.

Since the initial release in mid 2006 the grow-ing user base has voiced a number of suggestionsand requests for improving KNIME’s usability andfunctionality. From the beginning KNIME has sup-ported open standards for exchanging data and mod-els. Early on, support for the Predictive ModelMarkup Language (PMML) 15 was added and mostof the KNIME mining modules natively supportPMML, including association analysis, clustering,regressions, neural network, and tree models. Withthe latest KNIME release, PMML support was en-hanced to cover PMML 4.1. See 18 for more details.

Before dicussing how fuzzy types and learningmethods can be integrated into KNIME, let us firstdiscuss the KNIME architecture in more detail.

3. KNIME Architecture

The KNIME architecture was designed with threemain principles in mind.

• Visual, interactive framework: Data flows shouldbe combined by a simple drag and drop operationfrom a variety of processing units. Customizedapplications can be modeled through individualdata pipelines.

• Modularity: Processing units and data containersshould not depend on each other in order to enableeasy distribution of computation and allow for in-dependent development of different algorithms.Data types are encapsulated, that is, no types arepredefined, new types can easily be added bring-ing along type specific renderers and comparators.New types can be declared compatible to existingtypes.

• Easy expandability: It should be easy to add newprocessing nodes or views and distribute themthrough a simple plugin mechanism without theneed for complicated install/deinstall procedures.

In order to achieve this, a data analysis process con-sists of a pipeline of nodes, connected by edges thattransport either data or models. Each node processesthe arriving data and/or model(s) and produces re-sults on its outputs when requested. Fig. 3 schemat-ically illustrates this process.

Fig. 3. A schematic for the flow of data and models in aKNIME workflow.

chttp://www.actuate.comdhttp://www.r-project.orgehttp://www.cs.waikato.ac.nz/ml/weka/

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

36

Page 4: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

The type of processing ranges from basic dataoperations from filtering or merging to simple statis-tical functions ranging from computations of mean,standard deviation or linear regression coefficientsto the computation of intensive data modeling oper-ators (clustering, decision trees, neural networks, toname just a few). In addition, most of the modelingnodes allow for an interactive exploration of their re-sults through accompanying views. In the followingwe will briefly describe the underlying schemata ofdata, node, workflow management and how the in-teractive views communicate.

3.1. Data Structures

All data flowing between nodes is wrapped withina class called DataTable, which holds meta-information concerning the type of its columns inaddition to the actual data. The data can be accessedby iterating over instances of DataRow. Each rowcontains a unique identifier (or primary key) and aspecific number of DataCell objects, which holdthe actual data. The reason to avoid access by RowID or index is scalability, that is, the desire to be ableto process large amounts of data and therefore not beforced to keep all of the rows in memory for fast ran-dom access. KNIME employs a powerful cachingstrategy, which moves parts of a data table to thehard drive if it becomes too large. Fig. 4 shows aUML diagram of the main underlying data structure.

3.2. Nodes

Nodes in KNIME are the most general process-ing units and usually resemble one node in thevisual workflow representation. The class Node

wraps all functionality and makes use of user-defined implementations of a NodeModel, possi-bly a NodeDialog, and one or more NodeView

instances if appropriate. Neither dialog nor viewmust be implemented if no user settings or viewsare needed. This schema follows the well-knownModel-View-Controller design pattern.

Fig. 4. A UML diagram of the data structure and the mainclasses it relies on.

In addition, each node has a number of Inportand Outport instances for the input and output con-nections, which can either transport data or models.Fig. 5 shows a UML diagram of this structure.

Fig. 5. A UML diagram of the Node and the main classes itrelies on.

3.3. Workflow Management

Workflows in KNIME are essentially graphs con-necting nodes, or more formally, a directed acyclicgraph (DAG). The WorkflowManager allows newnodes to be inserted and directed edges (connec-tions) between two nodes to be added. It also keepstrack of the status of nodes (configured, executed,

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

37

Page 5: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME

...) and returns, on demand, a pool of executablenodes. This way the surrounding framework canfreely distribute the workload among a couple ofparallel threads or – optionally – even a distributedcluster of servers. Thanks to the underlying graphstructure, the workflow manager is able to determineall nodes required to be executed along the pathsleading to the node the user actually wants to exe-cute.

3.4. Views and Interactive Brushing

Each Node can have an arbitrary number of viewsassociated with it. Through receiving events from aHiLiteHandler (and sending events to it) it is pos-sible to mark selected points in such a view to enablevisual brushing described earlier. Views can rangefrom simple table views to more complex views onthe underlying data (e. g. scatterplots, parallel coor-dinates) or the generated model (e. g. decision trees,rules).

3.5. (Fuzzy) Types in KNIME

KNIME features a modular and extensible type con-cept. As described earlier, tables in KNIME containmeta information about the types contained in eachcolumn. Fig. 6 shows this setup in more detail.

Fig. 6. A schematic showing how data tables can be ac-cessed in KNIME.

This meta information essentially enumerates allpossible types (subclasses of DataValue) that allcells in that column implement. Particular cellimplementations (extending DataCell) can imple-ment one or more of these values, IntCell, forinstance, implements both IntValue as well as

DoubleValue, as an integer can be represented asa double without loosing any information. The re-verse is obviously not true, so DoubleCell only im-plements DoubleValue. Fig. 7 shows this setup inmore detail.

Fig. 7. A schematic showing how data types are organizedin KNIME.

Inspecting the KNIME source code reveals,however, that DoubleCell does implementadditional extensions of DataValue namelyFuzzyNumberValue and FuzzyIntervalValue

(there are also a few other interfaces implementedsuch as ComplexNumberValue which we will notfocus on here). Any double can obviously alsorepresent a singleton fuzzy number16 or an ex-treme fuzzy interval with singleton core and sup-port (or a complex number with 0i) so these ex-tensions allow normal doubles to be treated asfuzzy numbers resp. intervals as well. However,of more interest are obviously the real implementa-tions FuzzyIntervalCell and FuzzyNumberCellwhich in this case represent trapezoidal resp. tri-angular membership functions over a real-valueddomain20.

Fig. 8 shows how this is represented in KN-IME for a small fuzzy rule set learned on the Irisdata11. The meta information about the table onthe right is displayed at the top. When a ta-ble contains fuzzy intervals/numbers the headers ofthese columns represent the most common super-type (FuzzyIntervalCell in this case) and alsosome additional properties. An upper and lowerbound can be given for some types (as is the casefor the first four columns), while the nominal val-ues are listed for others (as can be seen in the fifthcolumn).

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

38

Page 6: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

Fig. 8. An example of fuzzy intervals in KNIME. The un-derlying meta data is at the top, while the data as it is shownin a table can be seen underneath.

In the following we describe how a number ofprominent fuzzy learning methods can be easily em-bedded into this general framework.

4. Fuzzy C-Means

The well-known fuzzy c-means algorithm7 is con-tained in one learning module/node of KNIME. Theconfiguration dialog of the node is shown in Fig. 9,it exposes the usual parameters of the standard im-plementation in addition to the setup of a noisecluster10.

Fig. 9. The dialog of the fuzzy c-means clustering node,displaying the available options.

Fig. 10. The output of the fuzzy c-means clustering node,here using the bar renderer to display the degrees of mem-bership.

Note the small button next to the ”Number of clus-ters” field. This indicates that this setting can be eas-ily controlled by a workflow variablef. It enablesworkflows to be set up that loop over different num-bers of clusters, running e.g. a cross-validation runand collecting the results over all cluster settings.

Fig. 10 shows the output of the clustering nodefor the well known Iris data set, where the degreeof membership is displayed for each pattern. Vari-ous rendering options are available for each column,here a bar char was chosen.

5. Fuzzy Rule Induction

For fuzzy rule induction the FRL algorithm 1,12 wasused as a basis. The algorithm constructs fuzzy clas-sification rules and can use nominal as well as nu-merical attributes. For the latter, it automaticallyextracts fuzzy intervals for selected attributes. Oneof the convenient features of this algorithm is thatit only uses a subset of the available attributes foreach rule, resulting in so-called free fuzzy rules. TheKNIME implementation follows the published algo-rithm closely, allowing various algorithmic optionsto be set as well as different fuzzy norms. After ex-ecution, the output is a model description in a KN-IME internal format and a table holding the rulesas fuzzy interval constraints on each attribute plussome additional statistics (number of covered pat-terns, spread, volume etc.). These KNIME repre-

fActually all parameters of a node can be controlled by workflow variables but this button makes it easier for typical variables, whichare often controlled from the outside.

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

39

Page 7: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME

sentations can be used to further process the rule setbut also for display purposes.

Fig. 11 shows an MDS projection of the 4-dimensional rules on to two dimensions. The colorindicates the class of each rule, the size the numberof covered patterns. More details can be found in 5.

6. Visual Fuzzy Clustering

Another interesting aspect of KNIME is its visual-ization capabilities. As mentioned above, views inKNIME support cross-view selection mechanisms(called hiliting) but views can also be more interac-tive. One such example is the set of nodes for visual

Fig. 11. The fuzzy rules induced from the Iris data projectedon to a two dimensional space. Size represents coverage,color the class of the rule.

fuzzy clustering. The nodes can actually performsuch clustering in multiple descriptor spaces (paral-lel universes) in parallel 6.

For the purpose of this paper, however, a side as-pect of this work is more interesting. The methodsdescribed in 6 allow fuzzy clusters to be identifiedand revised interactively in these parallel universes.Fig. 12 shows a screenshot of the interactive view ofthis KNIME node again for the Iris data. Each rowshows a so-called Neighborgram for the data pointsof interest (usually a user specified class). A singleneighborgram represents an object’s neighborhood,which is defined by a similarity measure. It containsa fixed number of nearest neighbors to the centroidobject, whereby good, i.e. discriminative, neighbor-grams will have objects of the centroid’s class in the

close vicinity.

Fig. 12. The view of the visual fuzzy clustering node. Clus-ters are presented and fine tuned iteratively by the user.

KNIME also contains nodes for automatic clus-tering using the data structures. The neighborgramsare then constructed for all objects of interest, e.g.belonging to the minority class, in all available uni-verses. The learning algorithm derives cluster candi-dates from each neighborgram and ranks these basedon their qualities (e.g. coverage or another qualitymeasure). The model construction is carried out in asequential covering-like manner, i.e. starting with allneighborgrams and their cluster candidates, takingthe numerically best one, adding it as a cluster andproceeding with the remaining neighborgrams whileignoring the already covered objects. This simpleprocedure already produces a set of clusters, whichpotentially originate from diverse universes. Exten-sions to this algorithm reward clusters, which groupin different universes simultaneously, and thus re-spect overlaps. Another interesting usage scenarioof the neighborgram data structure is the possibil-ity to display them and thus involve the user in thelearning process. Especially the ability to visuallycompare the different neighborhoods of the sameobject has proven to be useful in molecular appli-cations and for the categorization of 3D objects.

7. Ongoing Work

Current development also includes a number of pro-totypes for other fuzzy-based analysis and/or visu-alization methods. It is worth mentioning two more

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

40

Page 8: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

visualization-based methods.

Performing multi-dimensional scaling (or mostother projection methods from a higher dimensionalspace on to two dimensions) usually loses informa-tion pertaining to the uncertainty of the underlyingfuzzy points / fuzzy sets. This can be seen in Fig. 11above. It is not possible to infer from the picturewhether the fuzzy sets overlap or how close theircore/support regions are. An approach presentedin13 addresses this limitation by also showing esti-mates for the spread towards neighboring points inthe projection. Fig. 13 shows an example for thistype of visualization.

Fig. 13. A prototypical view on projected fuzzy pointsalso displaying estimates for overlap/vicinity of neighbor-ing points.

Another way of visualizing points in mediumhigh dimensional spaces are parallel coordinates.In 4 an extension for this type of visualization waspresented, which extends the mechanism to alsoshow fuzzy points or rules. Fig. 14 shows two ofthe rules learned for the Iris data set.

Fig. 14. A visualization of fuzzy rules in parallel coordi-nates.

8. Other Extensions

In addition to native, built-in nodes, KNIME also al-lows existing tools to be wrapped easily. An externaltool node allows command line tools to be launched,whereas integrations for Matlab, R, and other dataanalysis or visualization tools allow existing fuzzylearning methods such as ANFIS to be integrated aswell.

However, a number of existing wrappers aroundlibraries such as LibSVM or Christian Borgelt’s As-sociation Rule and Itemset Mining library demon-strates that it is also feasible to integrate existing toolsets more tightly.

9. Applications

The fuzzy extensions for KNIME discussed here arenot only of academic interest but enable users touse these tools easily in practice. In the followingwe will show two examples. The first one demon-strates the usefulness of the visual fuzzy cluster-ing approach for the exploration of high through-put screening data and the second one focuses ona more complex molecular space modeling taskaround fuzzy c-means and how the resulting fuzzypartitioning of the space can be visually explored us-ing the KNIME network processing modules.

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

41

Page 9: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME

9.1. Screening Data Analysis

An interesting example for the use of the visualfuzzy clustering methods presented above was re-ported in 19. The Neighborgram-based clusteringmethod was applied to a well-known data set fromthe National Cancer Institute, the DTP AIDS An-tiviral Screen data set. The screen utilized a biologi-cal assay to measure protection of human CEM cellsfrom HIV-1 infection. All compounds in the dataset were tested for their protection of the CEM cell;those that did not provide at least 50% protectionwere labeled as confirmed inactive (CI). All otherswere tested in a second screening. Compounds thatprovided protection in this screening, too, were la-beled as confirmed active (CA), the remaining onesas moderately active (CM). Those screening resultsand chemical structural data on compounds that arenot protected by a confidentiality agreement can beaccessed onlineg. 41,316 compounds are available,of which we have used 36,452. A total of 325 be-long to class CA, 877 are of class CM and the re-maining 34,843 are of class CI. Note the class distri-bution for this data set is very unbalanced. There areabout 100 times as many inactive compounds (CI)as there are active ones (CA), which is very commonfor this type of screening data analysis: although itis a relatively large data set, it has an unbalancedclass distribution with the main focus on a minorityclass, the active compounds. The focus of analysisis on identifying internal structures in the set of ac-tive compounds that appeared to protect CEM cellsfrom the HIV-1 infection.

In order to generate Neighborgrams for thisdataset, a distance measure needs to be defined.We initially computed Fingerprint descriptors8,which represent each compound through a 990-dimensional bit string. Each bit represents a(hashed) specific chemical substructure of interest.The used distance metric was a Tanimoto distance,which computes the number of bits that are differentbetween two vectors normalized over the number ofbits that are turned on in the union of the two vec-tors. This type of distance function is often used incases like this, where the used bit vectors are only

sparsely occupied with 1s.Experiments with this (and other similar) data

sets demonstrate well how interactive clustering incombination with Neighborgrams helps to inject do-main knowledge in the clustering process and howNeighborgrams help to inspect promising clustercandidates quickly and visually. Fig. 15 shows oneexample of a cluster discovered during the explo-ration, grouping together parts of the chemical fam-ily of Azido Pyrimidines, probably one of the best-known classes of active compounds for HIV.

This application demonstrates perfectly howfuzzy clustering techniques are critical for real worldapplications. Attempting to partition this type ofscreening data into crisp clusters would be abso-lutely futile due to the underlying fairly noisy data.Instead, by suggesting clusters to the user and hav-ing him/her fine tune the (fuzzy) boundaries andthen continuing the interactive clustering procedureallows the user to inject background knowledge intothe clustering process on the fly.

Fig. 15. A fuzzy cluster of the NIH-Aids data centeredaround compound #646436. This cluster nicely covers partof one of the most well-known classes of active compounds:Azido Pyrimidines.

9.2. Molecular Space Modeling

Trying to get a first impression of a large molecu-lar database is often a challenge because the con-tained compounds do often not belong to one groupalone but share properties with more than one chem-ical group – obviously this naturally lends itself to

ghttp://dtp.nci.nih.gov/docs/aids/aidsdata.html

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

42

Page 10: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

a modeling of the space using fuzzy techniques.The workflow depicted in Fig. 17 generates suchan overview using a number of more complex subworkflows:

• on the left, the information is read from two files,one containing the structures, the other one con-taining additional information about each com-pound;

• the next metanode contains a subworkflow creat-ing additional chemical descriptors;

• the internals of the next metanode are displayedin Fig. 18, it determines optimal settings for theparameters of the fuzzy C-means algorithm: thenumber of clusters and the fuzzifier;

• the fuzzy C-means node than uses these settingsfor the final clustering of the entire dataset;

• the last three metanodes create an overview of theclusters, sample the data down so that structurescan be displayed meaningfully later, and createthe actual network which is then displayed by thefinal node.

The sub workflow shown in Fig. 18 is particu-larly interesting here because it illustrates the use ofthe KNIME looping concept. The loop start nodeon the left takes its input from a table with severalsettings for the number of clusters and the fuzzifierof the fuzzy-C-means applied to one portion of theoverall data. The cluster assigner on the other parti-tion of our data is then evaluated for its quality (es-sentially measuring clustering quality indices basedon within and across cluster distances). The loopend node collects the summary information of eachrun and the sorter then picks the best iteration and re-turns it as variables to be fed into the fuzzy C-meansin the overall workflow. Setups similar to this canbe used to do model selection also across differentmodel classes, the KNIME example server holds acouple of examples for this as well.

The metanode on the bottom (”Create ClusterOverview”) extracts the most common substructureof all molecules belonging to a cluster, the resultingtable is shown in Fig. 16. For a chemist, those rep-resentations quickly reveal the main composition ofthe underlying database. However, such crisp reduc-tions do not reveal more insights.

Fig. 16. The most common substructures of the five clus-ters.

We do not go into much detail about the in-terna of the subsequent metanodes as they mainlyfocus on building a network of cluster centers andmolecules (as nodes) and introduce edges betweenthose weighted by the corresponding degree ofmembership. One resulting view is shown in Fig. 19.One can quickly see the main constituents of theclusters, illustrated by a couple of representativemolecular structures with a high degree of member-ship only to that one cluster (we filtered out the ma-jority of the compounds strongly belonging to oneclass for sake of readability). Compounds that aremore ambiguous are positioned inbetween two – orin some cases also three – clusters. These are alsomolecular structures that can be assigned to a spe-cific chemical group less clearly. The true powerof such visualizations lies in their interactivity, ofcourse, just like in many of the other examples. TheKNIME network visualization extension allows tohighlight points as well and zoom in to focus on de-tails of the network.

10. Conclusions

We have described fuzzy extensions in KNIME andillustrated how classic fuzzy learning methods areeasily integrated. We also illustrated how some ofthe techniques described here can be used in realworld application such as the visual clustering of

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

43

Page 11: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

Fuzzy Logic in KNIME

Figure 17: A workflow for the model and visualization of a molecular space.

high throughput screening data and the modeling ofmolecular spaces.

KNIME offers a solid basis for more fuzzy learn-ing and visualization methods and we look forwardto collaborating with the fuzzy community to extendthis area of tool coverage in KNIME further.

Acknowledgments

We thank the other members of the KNIME Teamand the very active KNIME Community!

References

1. M.R. Berthold. “Mixed Fuzzy Rule Formation,” In In-ternational Journal of Approximate Reasoning (IJAR),32, 67–84, Elsevier, 2003.

2. M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel,T. Kotter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and

B. Wiswedel. “KNIME: The Konstanz InformationMiner,” In Studies in Classification, Data Analysis,and Knowledge Organization (GfKL 2007). Springer,2007.

3. M.R. Berthold, N. Cebron, F. Dill, T. R. Gabriel,T. Kotter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel.“KNIME: The Konstanz Information Miner. Version2.0 and Beyond,” In SIGKDD Explorations. ACMPress, 11(1), 2009.

4. M.R. Berthold and L.O. Hall. “Visualizing FuzzyPoints in Parallel Coordinates,” In IEEE Transactionson Fuzzy Systems, 11(3), 369–374, 2003.

5. M.R. Berthold and R. Holve. “Visualizing High Di-mensional Fuzzy Rules,” In Proceedings of NAFIPS,64–68, IEEE Press, 2000.

6. M.R. Berthold, B. Wiswedel, and D.E. Patterson. “In-teractive Exploration of Fuzzy Clusters Using Neigh-borgrams,” In Fuzzy Sets and Systems, 149(1), 21–37,Elsevier, 2005.

7. J.C. Bezdek. “Pattern Recognition with Fuzzy Objec-tive Function Algorithms,” Plenum Press, New York,

Figure 18: The subworkflow iterating over several settings of the fuzzy c-means algorithm to identify the optimalnumber of clusters and the value of the fuzzifier.

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

44

Page 12: Fuzzy Logic in KNIME – Modules for Approximate Reasoning€¦ · 2 KNIME.com AG, Technoparkstrasse 1, 8005 Zurich, Switzerland E-mail: Bernd.Wiswedel@KNIME.com, Thomas.Gabriel@KNIME.com

M.R. Berthold, B. Wiswedel, Th.R. Gabriel

Figure 19: The final network as displayed by KNIME. One can nicely see how a couple of molecular structuresfall clearly within one cluster. Others, however, belong to more than one cluster and have therefore substantialconnections to more than one cluster node.

1981.8. R.D. Clark. “Relative and absolute diversity analy-

sis of combinatorial libraries Combinatorial LibraryDesign and Evaluation,” Marcel Dekker, New York,337-362, 2001.

9. T. Curk, J. Demsar, Q. Xu, G. Leban, U. Petrovic,I. Bratko, G. Shaulsky, and B. Zupan. “Microarraydata mining with visual programming,” Bioinformat-ics. 21(3), 396–408, 2005.

10. R.N. Dave. “Characterization and detection of noise inclustering,” In Pattern Recognition Letters, 12, 657–664, 1991.

11. R.A. Fisher. “The use of multiple measurements intaxonomic problems,” In Annals of Eugenics, 7(2),179–188, 1936.

12. Th.R. Gabriel and M.R. Berthold. “Influence of fuzzynorms and other heuristics on ’Mixed Fuzzy Rule For-mation’,” In International Journal of ApproximateReasoning (IJAR), 35, 195–202, Elsevier, 2004.

13. Th.R. Gabriel, K. Thiel, and M.R. Berthold. “RuleVisualization based on Multi-Dimensional Scaling,”In IEEE International Conference on Fuzzy Systems,2006.

14. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-

mann, and I.H. Witten. “The WEKA Data MiningSoftware: An Update,” SIGKDD Explorations, 11(1),2009.

15. A. Guazzelli, W. Lin, and T. Jena. “PMML in Action:Unleashing the Power of Open Standards for DataMining and Predictive Analytics.”

16. M. Hanss. “Applied Fuzzy Arithmetic, An Introduc-tion with Engineering Applications,” Springer, 2005.

17. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz,and T. Euler. “YALE: Rapid Prototyping for Com-plex Data Mining Tasks,” In Proceedings of the 12thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, 2006.

18. D. Morent, K. Stathatos, W.-C. Lin, andM.R. Berthold. “Comprehensive PMML Pre-processing in KNIME,” In Proceedings of the PMMLWorkshop, KDD, 2011.

19. B. Wiswedel, D.E. Patterson, and M.R. Berthold. “In-teractive Exploration of Fuzzy Clusters,” In: J.V. deOliveira, and W. Pedrycz (eds) Advances in FuzzyClustering and its Applications. John Wiley and Sons,123-136, 2007.

20. L.A. Zadeh. “Fuzzy sets,” In Information and Control,8(3), 338-353, 1965.

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

45