Computational intelligence for data understanding

Computational intelligence Computational intelligence for data understandingfor data understanding

Włodzisław Duch

Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Google: W. Duch

Best Summer Course’08

PlanPlanWhat is this tutorial about ?

• How to discover knowledge in data; • how to create comprehensible models of data; • how to evaluate new data;• how to understand what CI methods do.

1. AI, CI & Data Mining.2. Forms of useful knowledge.3. Integration of different methods in GhostMiner. 4. Exploration & Visualization.5. Rule-based data analysis .6. Neurofuzzy models.7. Neural models, understanding what they do.8. Similarity-based models, prototype rules.9. Case studies.10. DM future: k-separability and meta-learning.11. From data to expert systems.

AI, CI & DMAI, CI & DMArtificial Intelligence: symbolic models of knowledge. • Higher-level cognition: reasoning, problem solving,

planning, heuristic search for solutions.• Machine learning, inductive, rule-based methods.• Technology: expert systems.

Computational Intelligence, Soft Computing:methods inspired by many sources: • biology – evolutionary, immune, neural computing• statistics, patter recognition• probability – Bayesian networks• logic – fuzzy, rough … Perception, object recognition.

Data Mining, Knowledge Discovery in Databases.• discovery of interesting patterns, rules, knowledge. • building predictive data models.

CI definitionCI definitionComputational Intelligence. An International Journal (1984)+ 10 other journals with “Computational Intelligence”,

D. Poole, A. Mackworth & R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning.

CI should: • be problem-oriented, not method oriented;• cover all that CI community is doing now, and is likely to do in future;• include AI – they also think they are CI ...

CI: science of solving (effectively) non-algorithmizable problems.

Problem-oriented definition, firmly anchored in computer sci/engineering.AI: focused problems requiring higher-level cognition, the rest of CI is more focused on problems related to perception/action/control.

What can we learn?What can we learn?Good part of CI is about learning.What can we learn?

Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not quite ...

Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:

• Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same. • Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. • There is no problem-independent or “best” set of features.

“Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.”

What is there to learn?What is there to learn?Brains ... what is in EEG? What happens in the brain?

Industry: what happens?

Genetics, proteins ...

Forms of useful knowledgeForms of useful knowledge

AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever.

But ... knowledge accessible to humans is in:

• symbols, • similarity to prototypes (intuition), • images, visual representations.

What type of explanation is satisfactory?Interesting question for cognitive scientists but ...

in different fields answers are different!

Forms of knowledgeForms of knowledge

3 types of explanation presented here:

• exemplar-based: prototypes and similarity;• logic-based: symbols and rules;• visualization-based: maps, diagrams, relations ...

• Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do.

• Humans create prototypes out of many examples – same as Gaussian classifiers, RBF networks, neurofuzzy systems.

• Logical rules are the highest form of summarization of knowledge, require good linguistic variables.

GhostMiner PhilosophyGhostMiner Philosophy

• There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, SVM, committees.

• Provide tools for visualization of data.

• Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

• Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list!

GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqs.pl/ghostminer/

• Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) =>

GhostMiner Developer & GhostMiner Analyzer

Wine data exampleWine data example

• alcohol content • ash content • magnesium content • flavanoids content • proanthocyanins phenols content • OD280/D315 of diluted wines

Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars.

Task: recognize the source of wine sample.13 quantities measured, all features are continuous:

• malic acid content • alkalinity of ash • total phenols content • nonanthocyanins phenols

content • color intensity • hue• proline.

Exploration and visualizationExploration and visualizationGeneral info about the data

Exploration: dataExploration: dataInspect the data

Exploration: data statisticsExploration: data statisticsDistribution of feature values

Proline has very large values, most methods will benefit from data standardization before further processing.

Exploration: data standardizedExploration: data standardizedStandardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std]

Other options: normalize to [-1,+1], or normalize rejecting p% of extreme values.

Exploration: 1D histogramsExploration: 1D histogramsDistribution of feature values in classes

Some features are more useful than the others.

Exploration: 1D/3D histogramsExploration: 1D/3D histogramsDistribution of feature values in classes, 3D

Exploration: 2D projectionsExploration: 2D projectionsProjections on selected 2D

Projections on selected 2D

Visualize data Visualize data Hard to imagine relations in more than 3D.Use parallel coordinates and other methods.

Linear methods: PCA, FDA, PP ... use input combinations.

SOM mappings: popular for visualization, but rather inaccurate, there is no measure of distortions.

Measure of topographical distortions: map all Xi points from Rn to xi points in Rm, m < n, and ask:

how well are Rij = D(Xi, Xj) distances reproduced

by distances rij = d(xi,xj) ?

Use m = 2 for visualization, use higher m for dimensionality reduction.

Sequences of the Globin familySequences of the Globin family

226 protein sequences of the Globin family; similarity matrix S(proteini,proteinj) shows high similarity values (dark spots) within subgroups, MDS shows cluster structure of the data (from Klock & Buhmann 1997); vector rep. of proteins is not easy.

Visualize data: MDSVisualize data: MDSMultidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) …

Minimize measure of topographical distortions moving the x coordinates.

2

1 2

2

2

2

3

1 MDS

11 Sammon

1 1 MDS local

ij iji jij

i j

ij

i jij iji j

ij iji jij

i j

S R rR

rS

R R

S r RR

x x

xx

x x

Large distances

intermediate

local structure as important aslarge scale

Visualize data: WineVisualize data: Wine3 clusters are clearly distinguished, 2D is fine.

The green outlier can be identified easily.

Decision treesDecision treesSimplest things should be done first: use decision tree to find logical rules.

Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.

Tree for 3 kinds of iris flowers,petal and sepal leafsmeasured in cm.

Decision bordersDecision bordersUnivariate trees: test the value of a single attribute x < a. or for nomial features select a subset of values.

Multivariate trees:test on combinations of attributes W.X < a.

Result: feature space is divided into large hyperrectangular areas with decision borders perpendicular to axes.

Splitting criteriaSplitting criteriaMost popular: information gain, used in C4.5 and other trees.

CART trees use Gini index of node purity (Renyi quadratic entropy):

Which attribute is better?

Which should be at the top of the tree?

Look at entropy reduction, or information gain index.2 2( ) lg lgt t f fE S P P P P

( , ) ( ) ( ) ( )S S

G S A E S E S E SS S

2

1

1C

ini ii

G node P

Non-Bayesian selectionNon-Bayesian selectionBayesian MAP selection: choose max a posteriori P(C|X)

Problem: for binary features non-optimal decisions are taken!

A=0 A=1P(C,A1) 0.0100 0.4900 P(C0)=0.5

0.0900 0.4100 P(C1)=0.5 P(C,A2) 0.0300 0.4700

0.1300 0.3700 P(C|X)=P(C,X)/P(X)

MAP is here equivalent to a majority classifier (MC): given A=x, choose maxC P(C,A=x)

MC(A1)=0.58, S+=0.98, S-=0.18, AUC=0.58, MI= 0.058MC(A2)=0.60, S+=0.94, S-=0.26, AUC=0.60, MI= 0.057

MC(A1)<MC(A2), AUC(A1)<AUC(A2), but MI(A1)>MI(A2) !

SSV decision treeSSV decision treeSeparability Split Value tree: based on the separability criterion.

SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.

( ) 2 , , , ,

min , , , , ,

c cc C

c cc C

SSV s LS s f D D RS s f D D D

LS s f D D RS s f D D

, , : ( , ) T

, , , ,

LS s f D D f s

RS s f D D LS s f D

X X

Define subsets of data D using a binary test f(X,s) to split the data into left and right subset D = LS RS.

SSV – complex treeSSV – complex treeTrees may always learn to achieve 100% accuracy.

Very few vectors are left in the leaves – splits are not reliable and will overfit the data!

SSV – simplest treeSSV – simplest treePruning finds the nodes that should be removed to increase generalization – accuracy on unseen data.

Trees with 7 nodes left: 15 errors/178 vectors.

SSV – logical rulesSSV – logical rulesTrees may be converted to logical rules.Simplest tree leads to 4 logical rules:

1. if proline > 719 and flavanoids > 2.3 then class 12. if proline < 719 and OD280 > 2.115 then class 23. if proline > 719 and flavanoids < 2.3 then class 34. if proline < 719 and OD280 < 2.115 then class 3

How accurate are such rules? Not 15/178 errors, or 91.5% accuracy!

Run 10-fold CV and average the results.85±10%? Run 10X and average85±10%±2%? Run again ...

SSV – optimal trees/rulesSSV – optimal trees/rulesOptimal: estimate how well rules will generalize.Use stratified crossvalidation for training;use beam search for better results.

1. if OD280/D315 > 2.505 and proline > 726.5 then class 12. if OD280/D315 < 2.505 and hue > 0.875 and malic-acid <

2.82 then class 23. if OD280/D315 > 2.505 and proline < 726.5 then class 24. if OD280/D315 < 2.505 and hue > 0.875 and malic-acid >

2.82 then class 35. if OD280/D315 < 2.505 and hue < 0.875 then class 3

Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 85±10%? Run 10X!

Logical rulesCrisp logic rules: for continuous x use linguistic variables (predicate functions).

sk(x) ş True [XkŁ x ŁX'k], for example: small(x) = True{x|x < 1}medium(x) = True{x|x [1,2]}large(x) = True{x|x > 2}

Linguistic variables are used in crisp (prepositional, Boolean) logic rules:

IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...

Crisp logic decisionsCrisp logic decisions

Crisp logic is based on rectangular membership functions:

True/False values jump from 0 to 1.

Step functions are used for partitioning of the feature space.

Very simple hyper-rectangular decision borders.

Expressive power of crisp logical rules is very limited!

Similarity cannot be captured by rules.

Logical rules - advantagesLogical rules - advantagesLogical rules, if simple enough, are preferable.

• Rules may expose limitations of black box solutions.

• Only relevant features are used in rules.

• Rules may sometimes be more accurate than NN and other CI methods.

• Overfitting is easy to control, rules usually have small number of parameters.

• Rules forever !? A logical rule about logical rules is:

IF the number of rules is relatively smallAND the accuracy is sufficiently high. THEN rules may be an optimal choice.

Logical rules - limitationsLogical rules - limitationsLogical rules are preferred but ...

• Only one class is predicted p(Ci|X,M) = 0 or 1; such black-and-white picture may be inappropriate in many applications.

• Discontinuous cost function allow only non-gradient optimization methods, more expensive.

• Sets of rules are unstable: small change in the dataset leads to a large change in structure of sets of rules.

• Reliable crisp rules may reject some cases as unclassified.

• Interpretation of crisp rules may be misleading.

• Fuzzy rules remove some limitations, but are not so comprehensible.

Fuzzy inputs vs. fuzzy rulesFuzzy inputs vs. fuzzy rules

Crisp rule Ra(x) = (xa) applied to uncertain input with uniform input uncertainty U(x;x)=1 in xx, xx] and zero outside is true to the degree given by a semi-linear function S(x;x):

For example, triangular U(x): leads to sigmoidal S(x) function.For more input conditions rules are true to the degree described by soft trapezoidal functions, difference of two sigmoidal functions.

Input uncertainty and the probability

that Ra(x) rule is true.For other input uncertainties similar relations hold!

Crisp rules + input uncertainty fuzzy rules for crisp inputs = MLP !

From rules to probabilitiesFrom rules to probabilitiesData has been measured with unknown error. Assume Gaussian distribution:

( ; , )x xx G G y x s

x – fuzzy number with Gaussian membership function.

A set of logical rules R is used for fuzzy input vectors: Monte Carlo simulations for arbitrary system => p(Ci|X)

Analytical evaluation p(C|X) is based on cumulant function:

1; , 1 erf ( )2 2

a

xx

a xa x G y x s dy a xs

2.4 / 2 xs Error function is identical to logistic f. < 0.02

Rules - choicesRules - choicesSimplicity vs. accuracy (but not too accurate!). Confidence vs. rejection rate (but not too much!).

true | predicted r

r

p p p pp

p p p p

Accuracy (overall) A(M) = p+ p

Error rate L(M) = p+ p Rejection rate R(M)=p+r+pr= 1L(M)A(M)Sensitivity S+(M)= p+|+ = p++ /p+

Specificity S(M)= p = p /p

p is a hit; p false alarm; p is a miss.

Rules – error functionsRules – error functionsThe overall accuracy is equal to a combination of sensitivity and selectivity weighted by the a priori probabilities:

A(M) = pS(M)+pS(M)

Optimization of rules for the C+ class; accuracy-rejection tradeoff:

large means no errors but high rejection rate.

E(M)= L(M)A(M)= (p+p) (p+p)minM E(M;) minM {(1+)L(M)+R(M)}

Optimization with different costs of errors

minM E(M;) = minM {p+ p} = minM {pS(M))pr(M) + [pS(M))pr(M)]}

ROC curvesROC curvesROC curves display S+ vs. (1S) for different models (classifiers) or different confidence thresholds:

Ideal classifier: below some threshold S+ = 1 (all positive cases recognized) for 1-S= 0 (no false alarms) .

Useless classifier (blue): same number of true positives as false alarms for any threshold.

Reasonable classifier (red): no errors until some threshold that allows for recognition of 0.5 positive cases, no errors if 1-S > 0.6; slowly rising errors in between.

Good measure of quality: high AUC, Area Under ROC Curve.

AUC = 0.5 is random guessing, AUC = 1 is perfect prediction.

Gaussian fuzzification of crisp rulesGaussian fuzzification of crisp rulesVery important case: Gaussian input uncertainty.

Rule Ra(x) = {xa} is fulfilled by Gx with probability:

( ) T ; , ( )a x xa

p R G G y x s dy x a

Error function is approximated by logistic function; assuming error distribution (x)x)), for s2=1.7 approximates Gauss < 3.5%

Rule Rab(x) = {b> x a} is fulfilled by Gx with probability:

( ) T ; , ( ) ( )b

ab x xa

p R G G y x s dy x a x b

Soft trapezoids and NNSoft trapezoids and NNThe difference between two sigmoids makes a soft trapezoidal membership functions.

Conclusion: fuzzy logic with soft trapezoidal membership functions (x) (x-b) to a crisp logic + Gaussian uncertainty of inputs.

Optimization of rulesOptimization of rulesFuzzy: large receptive fields, rough estimations.Gx – uncertainty of inputs, small receptive fields.

Minimization of the number of errors – difficult, non-gradient, but now Monte Carlo or analytical p(C|X;M).

21{ }; , | ; ( ),2x i i

X i

E X R s p C X M C X C

• Gradient optimization works for large number of parameters.

• Parameters sx are known for some features, use them as optimization parameters for others!

• Probabilities instead of 0/1 rule outcomes.• Vectors that were not classified by crisp rules have now non-zero

probabilities.

MushroomsMushroomsThe Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy.

8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2118=3.1035 possible input vectors.

Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicySpore print color: black, brown, buff, chocolate, green, orange, purple, white, yellow.

Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) Ů spore-print-color = Ř green

48 errors, 99.41% correct

This is why animals have such a good sense of smell! What does it tell us about odor receptors?

Mushrooms rulesMushrooms rulesTo eat or not to eat, this is the question! Not any more ...

A mushroom is poisonous if: R1) odor = Ř (almond anise none); 120 errors, 98.52% R2) spore-print-color = green 48 errors, 99.41% R3) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R4) habitat = leaves Ů cap-color = white no errors!

R1 + R2 are quite stable, found even with 10% of data;

R3 and R4 may be replaced by other rules, ex:

R'3): gill-size=narrow Ů stalk-surface-above-ring=(silky scaly) R'4): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.

Recurrence of breast cancerRecurrence of breast cancerInstitute of Oncology, University Medical Center, Ljubljana.

286 cases, 201 no (70.3%), 85 recurrence cases (29.7%)

9 symbolic features: age (9 bins), tumor-size (12 bins), nodes involved (13 bins), degree-malignant (1,2,3), area, radiation, menopause, node-caps. no-recurrence,40-49,premeno,25-29,0-2,?,2, left, right_low, yes

Many systems tried, 65-78% accuracy reported. Single rule:

IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence ELSE no-recurrence

77% accuracy, only trivial knowledge in the data: highly malignant cancer involving many nodes is likely to strike back.

Neurofuzzy systemNeurofuzzy system

Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like), with nodes realizing separable functions:

1

; ;i i ii

G X P G X P

Fuzzy: x(no/yes) replaced by a degree x. Triangular, trapezoidal, Gaussian or other membership f.

M.f-s in many dimensions:

FSMFSM

Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist.

If proline > 929.5 then class 1 (48 cases, 45 correct + 2 recovered by other rules).

If color < 3.79285 then class 2 (63 cases, 60 correct)

Interesting rules, but overall accuracy is only 88±9%

Initialize using clusterization or decision trees.Triangular & Gaussian f. for fuzzy rules.Rectangular functions for crisp rules.

Between 9-14 rules with triangular membership functions are created; accuracy in 10xCV tests about 96±4.5%

Similar results obtained with Gaussian functions.

Prototype-based rulesPrototype-based rules

IF P = arg minR D(X,R) THAN Class(X)=Class(P)

C-rules (Crisp), are a special case of F-rules (fuzzy rules).F-rules (fuzzy rules) are a special case of P-rules (Prototype).P-rules have the form:

D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P.

P-rules are easy to interpret!

IF X=You are most similar to the P=SupermanTHAN You are in the Super-league.

IF X=You are most similar to the P=Weakling THAN You are in the Failed-league.

“Similar” may involve different features or D(X,P).

P-rulesP-rulesEuclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm.

Manhattan function => (X;P)=exp{|X-P|}Various distance functions lead to different MF.

Ex. data-dependent distance functions, for symbolic data:

2

2

,,

, ,

,i i

i i ii

i i i i ii i

d X PD W X P

P i i ii i

D d X P W X P

e e e X P

X P

X P

X

, | |

, | |

VDM j i j ii j

PDF i j j ii j

D p C X p C Y

D p X C p C Y

X Y

X Y

PromotersPromotersDNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt

Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4

PDF distance, symbolic s=a, c, t, g replaced by p(s|+)

P-rulesP-rulesNew distance functions from info theory => interesting MF.

MF => new distance function, with local D(X,R) for each cluster.

Crisp logic rules: use Chebyshev distance (L norm):

DCh(X,P) = ||XP|| = maxi Wi |XiPi|

DCh(X,P) = const => rectangular contours.

Chebyshev distance with thresholds P

IF DCh(X,P) P THEN C(X)=C(P)

is equivalent to a conjunctive crisp rule

IF X1[P1PW1,P1PW1] …XN [PN PWN,PNPWN]THEN C(X)=C(P)

Decision bordersDecision borders

Euclidean distance from 3 prototypes, one per class.

Minkovski =20 distance from 3 prototypes.

D(P,X)=const and decision borders D(P,X)=D(Q,X).

P-rules for WineP-rules for WineManhattan distance: 6 prototypes kept, 4 errors, f2 removed

Chebyshev distance:15 prototypes kept, 5 errors, f2, f8, f10 removed

Euclidean distance:11 prototypes kept, 7 errors

Many other solutions.

Scatterograms for Scatterograms for hypothyroidhypothyroidShows images of training vectors mapped by neural network; for more than 2 classes either linear projections, or several 2D scatterograms, or parallel coordinates.

Good for:

• analysis of the learning process;

• comparison of network solutions;

• stability of the network;

• analysis of the effects of regularization;

• evaluation of confidence by perturbation of the query vector.

...

Details: W. Duch, IJCNN 2003

Neural networksNeural networks• MLP – Multilayer Perceptrons, most popular NN models.Use soft hyperplanes for discrimination.Results are difficult to interpret, complex decision borders. Prediction, approximation: infinite number of classes.

• RBF – Radial Basis Functions.

RBF with Gaussian functions are equivalent to fuzzy systems with Gaussian membership functions, but …

No feature selection => complex rules.

Other radial functions => not separable!

Use separable functions, not radial => FSM.

• Many methods to convert MLP NN to logical rules.

What NN really do?What NN really do?•Common opinion: NN are black boxes. NN provide complex mappings that may involve various kinks and discontinuities, but NN have the power!

•Solution 1 (common): extract rules approximating NN mapings.

•Solution 2 (new): visualize neural mapping.

RBF network for fuzzy XOR, using 4 Gaussian nodes:

rows for =1/7,1 and 7

left column: scatterogram of the hidden node activity in 4D.

middle columns: parallel coordinate view

right column: output view (2D)

Wine exampleWine example•MLP with 2 hidden nodes, SCG training, regularization =0.5

•After 3 iterations: output, parallel, hidden.

After convergence + with noise var=0.05 added

Rules from MLPsRules from MLPsWhy is it difficult?Multi-layer perceptron (MLP) networks: stack many perceptron units, performing threshold logic:

M-of-N rule: IF (M conditions of N are true) THEN ...

Problem: for N inputs number of subsets is 2N. Exponentially growing number of possible conjunctive rules.

MLP2LNMLP2LNConverts MLP neural networks into a network performing logical operations (LN).

Inputlayer

Aggregation: better features

Output: one node per class.

Rule units: threshold logic

Linguistic units: windows, filters

MLP2LN trainingMLP2LN trainingConstructive algorithm: add as many nodes as needed.

Optimize cost function:minimize errors +

enforce zero connections +

leave only +1 and -1 weightsmakes interpretation easy.

2

21

2 2 22

1( ) ( ( ; ) )2

2

( 1) ( 1)2

p pi i

p i

iji j

ij ij iji j

E F t

W

W W W

W X W

L-unitsL-unitsCreate linguistic variables.

1 1 2 2( ) 'L X S W x b S W x b

Numerical representation for R-nodes

Vsk=() for sk=lowVsk=() for sk=normal

L-units: 2 thresholds as adaptive parameters;

logistic (x), or tanh(x)[ Soft trapezoidal functions change into rectangular filters (Parzen windows).

4 types, depending on signs Si.

Product of bi-central functions is logical rule, used by IncNet NN.

Iris exampleIris exampleNetwork after training:

iris setosa: q=1 (0,0,0;0,0,0;+1,0,0;+1,0,0) iris versicolor: q=2 (0,0,0;0,0,0;0,+1,0;0,+1,0)iris virginica: q=1(0,0,0;0,0,0;0,0,+1;0,0,+1)

Rules:

If (x3=s x4=s) setosa If (x3=mx4=m) versicolor If (x3=l x4=l) virginica

3 errors only (98%).

Learning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.

Thyroid screeningThyroid screening

Garavan Institute, Sydney, Australia

15 binary, 6 continuous

Training: 93+191+3488 Validate: 73+177+3178

Determine important clinical factors Calculate prob. of each diagnosis.

Hiddenunits

Finaldiagnoses

TSHT4U

Clinical findings

Agesex……

T3

TT4

TBG

Normal

Hyperthyroid

Hypothyroid

Thyroid – some results.Thyroid – some results.Accuracy of diagnoses obtained with several systems – rules are accurate.

Method Rules/Features Training % Test %

MLP2LN optimized 4/6 99.9 99.36

CART/SSV Decision Trees 3/5 99.8 99.33

Best Backprop MLP -/21 100 98.5

Naïve Bayes -/- 97.0 96.1

k-nearest neighbors -/- - 93.8

Thyroid – output visualization.Thyroid – output visualization.2D – plot scatterogram of the vectors transformed by the network.

3D – display it inside the cube, use perspective.

ND – make linear projection of network outputs on the polygon vertices

Data mining packagesData mining packages

• No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data.

• Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

• Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list!

• We are building Intemi, completely new tools.

GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/

• Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer

Are we really Are we really so good?so good?

Surprise!

Almost nothing can be learned using such tools!

SVM for paritySVM for parityParity with growing number of dimensions: 5x10CV results, Gaussian kernel SVM and MLP with optimized parameters (C, , # neurons<20).

Q1: How to characterize complexity of Boolean functions? Non-separability is not sufficient. Q2: What is the simplest model for a given type of data? Q3: How to learn such model in an automatic way?

Can Data Mining packages help?Can Data Mining packages help?Hundreds of components ... transforming, visualizing ...

RM/Yale 3.3: type # componentsData preprocessing 74Experiment operations 35 Learning methods 114Metaoptimization schemes 17Postprocessing 5Performance validation 14Visualization, presentation, plugin extensions ...

Visual “knowledge flow” to link components, or script languages (XML) to define complex experiments.

Why solid foundations are neededWhy solid foundations are neededHundreds of components ... already 351 359 400 combinations! Our treasure box is full! We can publish forever!

Is this really what we need?No, we would like to …

press the button and wait for the truth!

Computer power is with us, meta-learning should find all interesting data models = sequences of transformations/procedures.

Many considerations: optimal cost solutions, various costs of using feature subsets; models that are simple & easy to understand; various representation of knowledge: crisp, fuzzy or prototype rules, visualization, confidence in predictions ...

Similarity-based frameworkSimilarity-based framework(Dis)similarity: • more general than feature-based description, • no need for vector spaces (structured objects), • more general than fuzzy approach (F-rules are reduced to P-rules), • includes nearest neighbor algorithms, MLPs, RBFs, separable

function networks, SVMs, kernel methods and many others!

Similarity-Based Methods (SBMs) are organized in a framework: p(Ci|X;M) posterior classification probability or y(X;M) approximators,models M are parameterized in increasingly sophisticated way.

A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem.

Results: several candidate models are created, even very limited version gives best results in 7 out of 12 Stalog problems.

SBM framework componentsSBM framework components• Pre-processing: objects O => features X, or (diss)similarities D(O,O’). • Calculation of similarity between features d(xi,yi) and objects D(X,Y).• Reference (or prototype) vector R selection/creation/optimization. • Weighted influence of reference vectors G(D(Ri,X)), i=1..k.• Functions/procedures to estimate p(C|X;M) or y(X;M). • Cost functions E[DT;M] and model selection/validation procedures. • Optimization procedures for the whole model Ma.• Search control procedures to create more complex models Ma+1.• Creation of ensembles of (local, competent) models.

• M={X(O), d(.,.), D(.,.), k, G(D), {R}, {pi(R)}, E[.], K(.), S(.,.)}, where:• S(Ci,Cj) is a matrix evaluating similarity of the classes;

a vector of observed probabilities pi(X) instead of hard labels.

The kNN model p(Ci|X;kNN) = p(Ci|X;k,D(.),{DT}); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(.),G(D),{R}), MLP, SVM and many other models may all be “re-discovered”.

Meta-learning in SBM schemeMeta-learning in SBM scheme

Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

k-NN 67.5/76.6%

+d(x,y); Canberra 89.9/90.7 % +

si=(0,0,1,0,1,1); 71.6/64.4 %

+selection, 67.5/76.6 %

+k opt; 67.5/76.6 %

+d(x,y) + si +d(x,y) + sel. or opt k;

k-NN 67.5/76.6%

+d(x,y); Canberra 89.9/90.7%

+ si=(0,0,1,0,1,1); 71.6/64.4 %

+selection, 67.5/76.6 %

+k opt; 67.5/76.6%

+d(x,y) + si=(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 %

+d(x,y) + selection; Canberra 89.9/90.7 %

Heterogeneous systemsHeterogeneous systemsProblems requiring different scales (multiresolution).

2-class problems, two situations:

C1 inside the sphere, C2 outside.MLP: at least N+1 hyperplanes, O(N2) parameters. RBF: 1 Gaussian, O(N) parameters.

C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside.MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N2) parameters, poor approx.

Combination: needs both hyperplane and hypersphere!

Logical rule: IF x1>0 & x2>0 THEN C1 Else C2

is not represented properly neither by MLP nor RBF!

Different types of functions in one model, first step beyond inspirations from single neurons => heterogeneous models.

Heterogeneous everythingHeterogeneous everythingHomogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs

Committees combine many models together, but lead to complex models that are difficult to understand.

Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS).

HAS examples:NN with different types of neuron transfer functions.k-NN with different distance functions for each prototype.Decision Trees with different types of test criteria.

1. Start from large networks, use regularization to prune.2. Construct network adding nodes selected from a candidate pool.3. Use very flexible functions, force them to specialize.

Taxonomy of NN activation functionsTaxonomy of NN activation functions

Taxonomy of NN output functionsTaxonomy of NN output functions

Perceptron: implements logical rule x> for x with Gaussian uncertainty.

Taxonomy Taxonomy - TF- TF

HAS decision treesHAS decision treesDecision trees select the best feature/threshold value for univariate and multivariate trees:

Decision borders: hyperplanes. Introducing tests based on L Minkovsky metric.

or ; ,i k k i i ki

X T W X X W

For L2 spherical decision border are produced.

For L∞ rectangular border are produced.

Many choices, for example Fisher Linear Discrimination decision trees.For large databases first clusterize data to get candidate references R.

1/; , R i i Ri

T X R

X R X R

SSV HAS DT exampleSSV HAS DT exampleSSV HAS tree in GhostMiner 3.0, Wisconsin breast cancer (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%).

Single rule gives simplest known description of this data: IF ||X-R303|| < 20.27 then malignant

else benign coming most often in 10xCV97.4% accuracy; good prototype for malignant case! Gives simple thresholds, that’s what MDs like the most!

Best 10CV around 97.5±1.8% (Naïve Bayes + kernel, or SVM)SSV without distances: 96.4±2.1%C 4.5 gives 94.7±2.0%

Several simple rules of similar accuracy but different specificity or sensitivity may be created using HAS DT. Need to select or weight features and select good prototypes.

More meta-learningMore meta-learningMeta-learning: learning how to learn, replace experts who search for best models making a lot of experiments.Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search.

• Abstract view, uniform I/O, uniform results management.• Directed acyclic graphs (DAG) of boxes representing scheme• placeholders and particular models, interconnected through I/O.• Configuration level for meta-schemes, expanded at runtime level.An exercise in software engineering for data mining!

Intemi, Intelligent MinerIntemi, Intelligent MinerMeta-schemes: templates with placeholders

• May be nested; the role decided by the input/output types.• Machine learning generators based on meta-schemes.• Granulation level allows to create novel methods.• Complexity control: Length + log(time)• A unified meta-parameters description, defining the range of

sensible values and the type of the parameter changes.

How much can we learn?How much can we learn?Linearly separable or almost separable problems are relatively simple – deform planes or add dimensions to make data separable.

How to define “slightly non-separable”? There is only separable and the vast realm of the rest.

Neurons learning complex logicNeurons learning complex logicBoole’an functions are difficult to learn, n bits but 2n nodes => combinatorial complexity; similarity is not useful, for parity all neighbors are from the wrong class. MLP networks have difficulty to learn functions that are highly non-separable.

Projection on W=(111 ... 111) gives clusters with 0, 1, 2 ... n bits;solution requires abstract imagination + easy categorization.

Ex. of 2-4D parity problems.

Neural logic can solve it without counting; find a good point of view.

Easy and difficult problemsEasy and difficult problemsLinear separation: good goal if simple topological deformation of decision borders is sufficient.Linear separation of such data is possible in higher dimensional spaces; this is frequently the case in pattern recognition problems. RBF/MLP networks with one hidden layer solve such problems.

Difficult problems: disjoint clusters, complex logic.Continuous deformation is not sufficient; networks with localized functions need exponentially large number of nodes.

Boolean functions: for n bits there are K=2n binary vectors that can be represented as vertices of n-dimensional hypercube. Each Boolean function is identified by K bits. BoolF(Bi) = 0 or 1 for i=1..K, leads to the 2K Boolean functions.

Ex: n=2 functions, vectors {00,01,10,11}, Boolean functions {0000, 0001 ... 1111}, ex. 0001 = AND, 0110 = OR,each function is identified by number from 0 to 15 = 2K-1.

Boolean functionsBoolean functionsn=2, 16 functions, 12 separable, 4 not separable.n=3, 256 f, 104 separable (41%), 152 not separable.n=4, 64K=65536, only 1880 separable (3%)n=5, 4G, but << 1% separable ... bad news!

Existing methods may learn some non-separable functions, but most functions cannot be learned !

Example: n-bit parity problem; many papers in top journals.No off-the-shelf systems are able to solve such problems.

For all parity problems SVM is below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known.

But parity is still trivial ... solved by 1

cosn

ii

y b

Abstract imagination Abstract imagination Transformation of data to a space where clustering is easy.Intuitive answers, as propositional rules may be difficult to formulate.

Here fuzzy XOR (stimuli color/size/shape are not identical) has been transformed by two groups of neurons that react to similarity.

Network-based intuition: they know the answer, but cannot say it ... If image of the data forms non-separable clusters in the inner (hidden) space network outputs will be often wrong.

3-bit parity in 2D and 3D3-bit parity in 2D and 3DOutput is mixed, errors are at base level (50%), but in the hidden space ...

Conclusion: separability in the hidden space is perhaps too much to desire ... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; train second RBF on the data (stacking), reducing number of clusters.

Goal of learningGoal of learningIf simple topological deformation of decision borders is sufficient linear separation is possible in higher dimensional spaces, “flattening” non-linear decision borders; this is frequently the case in pattern recognition problems. RBF/MLP networks with one hidden layer solve the problem.

For complex logic this is not sufficient; networks with localized functions need exponentially large number of nodes.

Such situations arise in AI reasoning problems, real perception, object recognition, text analysis, bioinformatics ...

Linear separation is too difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel space: line y=WX, with y<0 for class – and y>0 for class +.

Simplest extension: separation into k-intervals. For parity: find direction W with minimum # of intervals, y=W.X

3D case3D case3-bit functions: X=[b1b2b3], from [0,0,0] to [1,1,1]

f(b1,b2,b3) and f(b1,b2,b3) are symmetric (color change)

8 cube vertices, 28=256 Boolean functions. 0 to 8 red vertices: 1, 8, 28, 56, 70, 56, 28, 8, 1 functions.

For arbitrary direction W index projection W.X gives: k=1 in 2 cases, all 8 vectors in 1 cluster (all black or all white)k=2 in 14 cases, 8 vectors in 2 clusters (linearly separable) k=3 in 42 cases, clusters B R B or W R Wk=4 in 70 cases, clusters R W R W or W R W RSymmetrically, k=5-8 for 70, 42, 14, 2. Most logical functions have 4 or 5-separable projections.

Learning = find best projection for each function. Number of k=1 to 4-separable functions is: 2, 102, 126 and 26126 of all functions may be learned using 3-separability.

4D case4D case4-bit functions: X=[b1b2b3b4], from [0,0,0,0] to [1,1,1,1]

16 cube vertices, 216=65636=64K functions.

Random initialization of a single perceptron has 39.2% chance of creating 8 or 9 clusters for the 4-bit data.

Learning optimal directions W finds: k=1 in 2 cases, all 16 vectors in 1 cluster (all black or all white)k=2 in 2.9% cases (or 1880), 16 vectors in 2 clusters (linearly sep) k=3 in 22% of all cases, clusters B R B or W R Wk=4 in 45% of all cases, clusters R W R W or W R W Rk=5 in 29% of all cases.

Hypothesis: for n-bits highest k=n+1 ?

For 5-bits there are 32 vertices and already 232=4G=4.3.109 functions.Most are 5-separable, less than 1% is linearly separable!

Biological justificationBiological justification• Cortical columns may learn to respond to stimuli with complex logic

resonating in different way.• The second column will learn without problems that such different

reactions have the same meaning: inputs xi and training targets yj. are same => Hebbian learning Wij ~ xi yj => identical weights.

• Effect: same line y=W.X projection, but inhibition turns off one perceptron when the other is active.

• Simplest solution: oscillators based on combination of two neurons (W.X-b) – (W.X-b’) give localized projections!

• We have used them in MLP2LN architecture for extraction of logical rules from data.

• Note: k-sep. learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals!

• We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!

Network solutionNetwork solutionCan one learn a simplest model for arbitrary Boolean function? 2-separable (linearly separable) problems are easy; non separable problems may be broken into k-separable, k>2.

Blue: sigmoidal neurons with threshold, brown – linear neurons.

X1

X2

X3

X4

y=W.X

+1

1

+11

(y+)

(y+)

+1

+1+1+1

(y+

)

Neural architecture for k=4 intervals, or 4-separable problems.

QPC Projection PursuitQPC Projection PursuitWhat is needed to learn data with complex logic?• cluster non-local areas in the X space, use W.X• capture local clusters after transformation, use G(W.X)

What will solve it? Projected clusters!

1. A class of constructive neural network solution with G(W.X) functions with special training algorithms.

2. Maximize the leave-one-out error after projection: take localized function G, count in a soft way cases from the same class as X.

Classification: Regression: ~ |YXYX’|

Projection may be done directly to 1 or 2D for visualization, or higher D for dimensionality reduction, if W has D columns, for .

X X''

' ,Q G H C C X X

W W X X

X X' X X', 2 , 1 1, 1H C C C C

Parity n=9Parity n=9Simple gradient learning; QPC index shown below.

Learning hard functionsLearning hard functions

Training almost perfect for parity, with linear growth in the number of vectors for k-sep. solution created by the constructive neural algorithm.

Real dataReal data

Simple data – similar results, but much simpler models.

Transformation-based frameworkTransformation-based frameworkFind simplest model that is suitable for a given data, creating non-sep. that is easy to handle: simpler models generalize better, interpretation.

Compose transformations (neural layers), for example:

• Matching pursuit network for signal decomposition, QPC index.• PCA network, with each node computing principal component.• LDA nets, each node computes LDA direction (including FDA).• ICA network, nodes computing independent components.• KL, or Kullback-Leibler network with orthogonal or non-orthogonal components; max. of mutual information is a special case • 2 and other statistical tests for dependency to aggregate features.• Factor analysis network, computing common and unique factors.

Evolving Transformation Systems (Goldfarb 1990-2006), giving unified paradigm for inductive learning and structural representations.

Linear separabilityLinear separability

SVM visualization of Leukemia microarray data.

Approximate separabilityApproximate separability

SVM visualization of Heart dataset, overlapping clusters, information in the data is insufficient for perfect classification.

Interval transformationInterval transformation

Parity data: k-separability is much easier to achieve than full linear separability.

RulesRules

QPC visualization of Monks dataset, two logical rules are needed.

Complex distributionComplex distribution

QPC visualization of concentric rings in 2D with strong noise in another 2D; nearest neighbor or combinations of ellipsoidal densities.

SummarySummary

• Challenging data cannot be handled with existing DM tools. • Similarity-based framework enables meta-learning as search in the

model space, heterogeneous systems add fine granularity.• No off-shelf classifiers are able to learn difficult Boolean functions.• Visualization of hidden neuron’s shows that frequently perfect but

non-separable solutions are found despite base-rate outputs.• Linear separability is not the best goal of learning, other targets that

allow for easy handling of final non-linearities should be defined.• k-separability defines complexity classes for non-separable data. • Transformation-based learning shows the need for component-

based approach to DM, discovery of simplest models. • Most known and new learning methods result from such framework,

neurocognitive inspirations extend it even further.

Many interesting new ideas arise from this line of thinking.

PsychometryPsychometryUse CI to find knowledge, create Expert System.

MMPI (Minnesota Multiphasic Personality Inventory) psychometric test.Printed forms are scanned or computerized version of the test is used.

• Raw data: 550 questions, ex:I am getting tired quickly: Yes - Don’t know - No

• Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients.

• Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.

Psychometry: goalPsychometry: goal• There is no simple correlation between single values and final

diagnosis.

• Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.

Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level.

Problem: experts agree only about 70% of the time; alternative diagnosis and personality changes over time are important.

Psychometric dataPsychometric data

1600 cases for woman, same number for men.

27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ...

Extraction of logical rules: 14 scales = features.

Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.

Psychometric resultsPsychometric results

10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty ++GGxx around 1.5% (best ROC) improves FSM results to 90-92%.

Method Data N. rules Accuracy +Gx%

C 4.5 ♀ 55 93.0 93.7

♂ 61 92.5 93.1

FSM ♀ 69 95.4 97.6

♂ 98 95.9 96.9

Psychometric ExpertPsychometric ExpertProbabilities for different classes. For greater uncertainties more classes are predicted.

Fitting the rules to the conditions:typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green.

Verbal interpretation of each case, rule and scale dependent.

VisualizationVisualizationProbability of classes versus input uncertainty.

Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’.

Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.

SummarySummaryComputational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data.Understanding data: achieved by rules, prototypes, visualization.

Small is beautiful => simple is the best!Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty; high confidence, but not paranoid - rejecting some cases.

• Challenges:

hierarchical systems – higher-order rules, missing information; discovery of theories rather than data models; reasoning in complex domains/objects; integration with image/signal analysis; applications in data mining, bioinformatics, signal processing, vision ...

The EndThe End

Papers describing in details some of the ideas presented here may be accessed through my home page:

Google: W Duch or

http://www.is.umk.pl/~duch

We are slowly addressing the challenges. The methods used here (+ many more) are included in the Ghostminer, data mining software developed by my group, in collaboration with FQS, Fujitsu Kyushu Systems

http://www.fqs.pl/ghostminer/

Computational intelligence for data understanding

Documents

Transcript of Computational intelligence for data understanding