Final

DATA MINING

DATA MINING

USING

NEURAL NETWORKS

By:

Miss. Mukta Arankalle

BE II (Comp)

Roll No.: 201

CERTIFICATE

This is to certify that

Miss. Mukta Arankalle

Roll no. 201

BE II

Has completed the necessary seminar work and prepared the bona fide report on

DATA MINING USING NEURAL NETWORKS

In a satisfactory manner as partial fulfillment for requirement of the degree of

B.E (Computer)

Of

University of Pune

In the academic year 2002-2003

Date:

Place:

Prof. Prof. G P Potdar

Prof. Dr. C V K Rao

Internal Guide Seminar coordinator

H.O.D

DEPARTMENT OF COMPUTER ENGINEERING

PUNE INSTITUTE OF COMPUTER TECHNOLOGY

PUNE - 43

ACKNOWLEDGEMENTSI would like to extend my sincere gratitude to Prof. G.P. Potdar, (H.O.D., I.T.), P.I.C.T., for his encouragement and guidance.

I would like to thank Mr. Piyush Menon, (B.E. Comp) A.I.T., for his valuable help.

I would also like to thank Prof. Dr. C. V. K. Rao, (H.O.D. Computer Dept) and Prof. R. B. Ingle, our internal guide.

- Mukta Arankalle, BE (Comp), PICT.

INDEX

Chapter

Pg.No.

1. Data Mining5

1.1 Introduction

5

1.2 What is Data Mining?

6

1.3 Knowledge Discovery in Database

8

1.4 Other Related Areas

10

1.5 Data Mining Techniques

12

2. Neural Networks

16

2.1 Introduction16

2.2 Structure and Function of a Single Neuron

17

2.3 A Neural Net

18

2.4 Training the Neural Net

22

3. Neural Networks Based Data Mining

26

3.1 Introduction

26

3.2 Suitability of Neural Networks for Data Mining

26

3.3 Challenges Involved

27

3.4 Advantages

28

3.5 Extraction Methods

28

3.6 The TREPAN Algorithm

30

4. Conclusion

34

References

35

1. DATA MINING

1.1 Introduction

The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. The problem of effectively utilizing these massive volumes of data is becoming a major problem for all enterprises.

Data storage became easier as the availability of large amounts of computing power at low cost ie the cost of processing power and storage is falling, made data cheap. There was also the introduction of new machine learning methods for knowledge representation based on logic programming etc. in addition to traditional statistical analysis of data. The new methods tend to be computationally intensive hence a demand for more processing power.

It was recognized that information is at the heart of business operations and that decision-makers could make use of the data stored to gain valuable insight into the business. Database Management systems gave access to the data stored but this was only a small part of what could be gained from the data. Traditional on-line transaction processing systems, OLTPs, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where Data Mining has obvious benefits for any enterprise.

1.2 What is Data Mining?

1.2.1 Definition

Researchers William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus have defined Data Mining as:

Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database.

The analogy with the mining process is described as:

Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful", Clementine User Guide, a data mining toolkit.

1.2.2 Explanation

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing business to make proactive knowledge driven decisions. The automated, prospective analysis offered by data mining move beyond the analysis of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

The data mining process consists of three basic stages: exploration, model building and pattern definition. Fig. 1.1 shows a simple data mining structure.

Fig. 1.1 Data Mining Structure

Basically data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer which is responsible for finding the patterns by identifying the underlying rules and features in the data. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before.

Data mining analysis tends to work from the data up and the best techniques are those developed with an orientation towards large volumes of data, making use of as much of the collected data as possible to arrive at reliable conclusions and decisions. The analysis process starts with a set of data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired. Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data. Again this is analogous to a mining operation where large amounts of low grade materials are sifted through in order to find something of value.

1.2.3 Example

A home finance loan actually has an average life span of only 7 to 10 years, due to prepayment. Prepayment means, the loan is paid off early, rather than at the end of, say 25 years. People prepay loans when they refinance or when they sell their home. The financial return that a home-finance derives from a loan depends on its life span. Therefore it is necessary for the financial institutions to be able to predict the life spans of their loans. Rule discovery techniques are used to accurately predict the aggregate number of loan payments in a given quarter (or in a year), as a function of prevailing interest rates, borrower characteristics, and account data. This information can be used to finetune loan parameters such as interest rates, points and fees, in order to maximize profits.

1.3 Knowledge Discovery in Database (KDD)

1.3.1 KDD and Data Mining

Knowledge Discovery in Database (KDD) was formalized in 1989, with reference to the general concept of being broad and high level in pursuit of seeking knowledge from data. The term data mining was then coined; this high-level application technique is used to present and analyze data for decision-makers.

Data mining is only one of the many steps involved in knowledge discovery in databases. The KDD process tends to be highly iterative and interactive. Data mining analysis tends to work up from the data and the best techniques are developed with an orientation towards large volumes of data, making use of as much data as possible to arrive at reliable conclusions and decisions. The analysis process starts with a set of data, and uses a methodology to develop an optimal representation of the structure of data, during which knowledge is acquired. Once knowledge is acquired, this can be extended to large sets of data on the assumption that the large data set has a structure similar to the simple data set.

Fayyad distinguishes between KDD and data mining by giving the following definitions:

Knowledge discovery in databases is the process of identifying a valid, potentially useful and ultimately understandable structure in data.

Data mining is a step in the KDD process concerned with the algorithmic means by which patterns or structures are enumerated from the data under acceptable computational efficiency limits.

The structures that are the outcome of the data mining process must meet certain conditions so that these can be considered as knowledge. These conditions are: validity, understandability, utility, novelty and interestingness.

1.3.2 Stages of KDD

The stages of KDD, starting with the raw data and finishing with the extracted knowledge, are given below.

Fig. 1.2 Stages of KDD

Selection: This stage is concerned with selecting or segmenting the data that are relevant to some criteria. E.g.: for credit card customer profiling, we extract the type of transactions for each type of customers and we may not be interested in the details of the shop where the transaction takes place.

Preprocessing: Preprocessing is the data cleaning stage where unnecessary information is removed. E.g.: it is unnecessary to note the sex of a patient when studying pregnancy. This stage reconfigures the data to ensure a consistent format, as there is a possibility of inconsistent formats.

Transformation: The data is not merely transferred across, but transformed in order to be suitable for the task of data mining. In this stage, the data is made usable and navigable.

Data Mining: This stage is concerned with the extraction of patterns from the data.

Interpretation and Evaluation: The patterns obtained in the data mining stage are converted into knowledge, which in turn, is used to support decision-making.

1.4 Other Related Areas

Data Mining has drawn on a number of other fields, some of which are listed below.

1.4.1 Statistics

Statistics is a theory-rich approach for data analysis, which generates results that can be overwhelming and difficult to interpret. Not withstanding this, statistics is one of the foundations on which data mining technology is built. Statistical analysis systems are used by analysts to detect unusual patterns and explain patterns using statistical models. Statistics have an important role to play and data mining will not replace such analyses, but rather statistics can act upon more directed analyses based on the results of data mining.

1.4.2 Machine Learning

Machine learning is the automation of a learning process and learning is tantamount to the construction of rules based on observations. This is a broad field, which includes not only learning from examples, but also reinforcement learning, learning with teacher, etc. A learning algorithm takes the data set and its accompanying information as input and returns a statement e.g. a concept representing the results of learning as output.

1.4.3 Inductive Learning

Induction is the inference of information from data and inductive learning is the model building process where the environment i.e. database is analyzed with a view to finding patterns. Similar objects are grouped in classes and rules formulated whereby it is possible to predict the class of unseen objects. This process of classification identifies classes such that each class has a unique pattern of values, which forms the class description. The nature of the environment is dynamic hence the model must be adaptive i.e. should be able learn. Inductive learning where the system infers knowledge itself from observing its environment has two main strategies: Supervised Learning and Unsupervised Learning.

1.4.4 Supervised Learning

This is learning from examples where a teacher helps the system construct a model by defining classes and supplying examples of each class.

1.4.5 Unsupervised Learning

This is learning from observation and discovery.

1.4.6 Mathematical Programming

Most of the major data mining tasks can be equivalently formulated as problems in mathematical programming for which efficient algorithms are available. It provides a new insight into the problems of data mining.

1.5 Data Mining Techniques

Researchers identify two fundamental goals of data mining: prediction and description. Prediction makes use of existing variables in the database in order to predict unknown or future values of interest, and description focuses on finding patterns describing the data and the subsequent presentation for user interpretation. The relative emphasis of both, prediction and description differ with respect to the underlying application and technique.

There are several data mining techniques fulfilling these objectives. Some of these are associations, classifications, sequential patterns and clustering.

Another approach of the study of data mining techniques is to classify the techniques as: user-guided or verification-driven data mining and, discovery-driven or automatic discovery of rules. Most of the techniques of data mining have elements of both the models.

1.5.1 Data Mining Models

Verification Model:

The verification model takes an hypothesis from the user and tests the validity of it against the data. The emphasis is with the user who is responsible for formulating the hypothesis and issuing the query on the data to affirm or negate the hypothesis.

In a marketing division, for example, with a limited budget for a mailing campaign to launch a new product it is important to identify the section of the population most likely to buy the new product. The user formulates a hypothesis to identify potential customers and the characteristics they share. Historical data about customer purchase and demographic information can then be queried to reveal comparable purchases and the characteristics shared by those purchasers. The whole operation can be repeated by successive refinements of hypothesis until the required limit is reached.

Discovery Model:

The discovery model differs in its emphasis in that it is the system automatically discovering important information hidden in the data. The data is sifted in search of frequently occurring patterns, trends and generalizations about the data without intervention or guidance from the user. The discovery or data mining tools aim to reveal a large number of facts about the data in as short a time as possible.

An example of such a model is a supermarket database, which is mined to discover the particular groups of customers to target for a mailing campaign. The data is searched with no hypothesis in mind other than for the system to group the customers according to the common characteristics found.

1.5.2 Discovery Driven Tasks

The typical discovery driven tasks are:

Association rules:

An association rule is an expression of the form X => Y, where X and Y are the sets of items. The intuitive meaning of such a rule is that the transaction of the database, which contains X tends to contain Y. Given a database, the goal is to discover all the rules that have the support and confidence greater than or equal to the minimum support and confidence, respectively.

Let L = {l1, l2, , lm} nbe a set of literals called items. Let D, the database, be a set of transactions, where each transaction T is a set of items. T supports an item x, if x is in T. T is said to support a subset of items X, if T supports each item x in X. X => y holds with confidence c, if c% of the transactions in D that support X also support Y. The rule X => Y has support s, in the transaction set D if s% of the transactions in D support X U Y. Support means how often X and Y occur together as a percentage of the total transactions. Confidence measures how much a particular item is dependent on another. Patterns with a combination of intermediate values of confidence and support provide the user with interesting and previously unknown information.

Clustering:

Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns. The algorithm tends to automatically partition the data space into a set of regions or clusters, to which the examples in the tables are assigned, either deterministically or probability-wise. The goal of the process is to identify all sets of similar examples in the data, in some optimal fashion.

Clustering according to similarity is a concept which appears in many disciplines. If a measure of similarity is available, then there are a number of techniques for forming clusters. Another approach is to build set functions that measure some particular property of groups. This latter approach achieves what is known as optimal partitioning.

Classification Rules:

Classification involves finding rules that partition the data into disjoint groups. The input for the classification data set is the training data set, whose class labels are already known. Classification analyses the training data set and constructs a model based on the class label, and aims to assign class label to the future unlabelled records. Since the class field is known, this type of classification is known as supervised learning.

There are several classification discovery models. They are: the decision tree, neural networks, genetic algorithms and some statistical models.

Frequent Episodes:

Frequent episodes are the sequence of events that occur frequently, close to each other and are extracted from the time sequences. How close it has to be to consider it as frequent is domain dependent. This is given by the user as the input and the output are the prediction rules for the time sequences.

Deviation Detection:

Deviation detection is to identify outlying points in a particular data set, and explain whether they are due to noise or other impurities being present in the data or due to trivial reasons. It is usually applied with the database segmentation, and is the source of true discovery, since the outliers express deviation from some previously known expectation and norm. By calculating the values of measures of current data and comparing them with previous data as well as with the normative data, the deviations can be obtained.

1.5.3 Data Mining Methods

Various data mining methods are:

Neural Networks

Genetic Algorithms

Rough Sets Techniques

Support Vector Machines

Cluster Analysis

Induction

OLAP

Data Visualization

2. NEURAL NETWORKS

2.1 Introduction

Anyone can see that the human brain is superior to a digital computer at many tasks. A good example is the processing of visual information: a one-year-old baby is much better and faster at recognizing objects, faces, and so on than even the most advanced AI system running on the fastest supercomputer. The brain has many other features that would be desirable in artificial systems.

This is the real motivation for studying neural computation. It is an alternative paradigm to the usual one (based on a programmed instruction sequence), which was introduced by von Neumann and has been used as the basis of almost all machine computation to date. It is inspired by the knowledge from neuroscience, though it does not try to be biologically realistic in detail.

Neural networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

Neural networks use a set of processing elements (or nodes) analogous to neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs, that simply follow instructions in a fixed sequential order.

2.2 Structure and Function of a Single Neuron

McCulloch and Pitts (in 1943) proposed a simple model of a neuron as a binary threshold unit. Specifically, the model neuron computes a weighted sum of its inputs from other units, and outputs a one or a zero according to whether this sum is above or below a certain threshold.

The figure below shows the structure of a typical artificial neuron:

Threshold

Fig. 2.1 Structure of a Single Neuron

Explanation:

The neuron has a set of nodes that connects it to inputs, output, or other neurons, also called synapses (connections / links). A Linear Combiner is a function that takes all inputs and produces a single value. A simple way of doing it is by adding together the weighted inputs. Thus, the linear combiner will produce: (wi1 * x1 + wi2 * x2 + + win * xn).

The Activation Function is a non-linear function, which takes any input from minus infinity to plus infinity and squeezes it into the -1 to 1 or into 0 to 1 interval.

This simple model of a neuron makes the following assumptions:

1. The position on the neuron (node) of the incoming synapse (connection) is irrelevant.

2. Each node has a single output value, distributed to other nodes via outgoing links, irrespective of their positions.

3. All inputs come in at the same time or remain activated at the same level long enough for computation to occur. (An alternative is to postulate the existence of buffers to store weighted inputs inside nodes).

The threshold is calculated using the Heaviside function as shown below:

ni(t+1) = ((j wijnj(t) - (i)

Here ni is either 1 or 0, and represents the state of neuron i as firing or not firing respectively. Time t is taken as discrete, with one time unit elapsing per processing step. (x) is the unit step function, or Heaviside function:

(x) = 1, if x>= 0

= 0, otherwise.

The weight wij represents the strength of the synapse connecting neuron j to neuron i. It can be positive or negative corresponding to an excitatory or inhibitory synapse respectively. It is zero if there is no synapse between i and j. The cell specific parameter (i is the threshold value for unit i; the weighted sum of inputs must reach or exceed the threshold for the neuron to fire.

A simple generalization of the above equation which will consider the activation function is:

ni = g ((j wijnj - (i)

The number ni is now continuous valued and is called state or activation of unit i. The function g(x) is the activation function.

Rather than writing the time t and t+1 explicitly, we now simply give a rule for updating ni whenever that occurs. Units are often updated asynchronously, in random order, at random times.

2.3 A Neural Net

A single neuron is insufficient for many practical problems, and network with a large number of nodes are frequently used. The way the nodes are connected determines how computations proceeds and constitutes an important early design decision by a neural network developer.

2.3.1 Fully Connected Networks

In this architecture, every node is connected to every node, and these connections may be either excitatory (positive weights), inhibitory (negative weights), or irrelevant (almost zero weights).

Fig 2.2 Fully Connected

Fig 2.3 Fully Connected

Asymmetric Network

Symmetric Network

In a fully connected asymmetric network, the connection from one node to another may carry a different weight than the connection from the second node to the first. In a symmetric network, the weight that connects one node to another is equal to its symmetric reverse.

Hidden nodes are the nodes, whose interaction with the external environment is indirect.

2.3.2 Layered Networks

These are networks in which nodes are partitioned into subsets called layers, with no connections that lead from layer j to layer k if j>k.

A single input arrives at and is distributed to other nodes by each node of the input layer or layer 0; no other computation occurs at nodes in layer 0, and there are no intra-layer connections among nodes in this layer. Connections with arbitrary weights, may exist from any node in layer i to any node in layer j for j >= i; intra-layer connections may exist.

Layer 0 (Input Layer) Layer 1

Layer 2 Layer 3(Output Layer)

( Hidden Layers )

Fig 2.4 A Layered Network

2.3.3 Acyclic Networks

This is a subclass of layered networks in which there are no intra-layer connections, as shown in the fig. 2.5. A connection may exist between any node in layer i and any node in layer j for i < j, but a connection is not allowed for i = j. Networks that are not acyclic are referred to as recurrent networks.

Layer 0 (Input Layer) Layer 1Layer 2 Layer 3(Output Layer)

( Hidden Layers )

Fig 2.5 An Acyclic Network

2.3.4 Feedforward Networks

Layer 0 (Input Layer) Layer 1

Layer 2 Layer 3(Output Layer)

( Hidden Layers )

Fig 2.6 A Feedforward 3-2-3-2 Network

This is a subclass of acyclic networks in which a connection is allowed from a node in layer i only to nodes in layer i+1 as shown in fig. 2.6. These networks are succinctly described by a sequence of numbers indicating the number of nodes in each layer.

These networks, generally with no more than 4 such layers, are among the most common neural nets in use. Conceptually, nodes in successively higher layers abstract successively higher-level features from preceding layers.

2.3.5 Modular Neural Networks

Most problems are solved using neural networks whose architecture consists of several modules, with sparse interconnections between modules. Modularity allows the neural network developer to solve smaller tasks separately using small (neural network) modules and then combine these modules in a logical manner. Modules can be organized in several different ways, some of which are: hierarchical organization, successive refinement and input modularity.

2.4 Training the Neural Net

In order to fit a particular artificial neural network (ANN) to a particular problem, it must be trained (or learned) to generate a correct response for a given set of inputs. Unsupervised training may be used when a clear link between data sets and target output values do not exist. Supervised training involves providing an ANN with specified input and output values, and allowing it to iteratively reach a solution.

2.4.1 Perceptron Learning Rule

This is the first learning scheme of neural computing. The weights are changed by an amount proportional to the difference between the desired output and the actual output. If W is the vector and (Wi is the change is the ith weight, then (Wi is proportional to a term which is the error times the input. A learning rate parameter decides the magnitude of change. If the learning rate is high, the change in the weight is bigger at every step.

(Wi = ( (D Y). Ii

where ( is the learning rate, D is the desired output, and Y is the actual output.

If the classes are visualized geometrically in n-dimensional space, then the perceptron generates descriptions of the classes in terms of a set of hyperplanes that separate these classes. When the classes are actually not linearly separable, then the perceptron (single layer) is not effective in properly classifying such cases.

2.4.2 Training in Multi-Layer Perceptron

The MLP overcomes the above shortcoming of the single layer perceptron. The idea is to carry out the computation layer-wise, moving in the forward direction. Similarly, the weight adjustment can be done layer-wise, by moving in a backward direction. For the nodes in the output layer, it is easy to compute the error, as we know the actual outcome and the desired result. For the nodes in the hidden layers, since we do not know the desired result, we propagate the error computed in the last layer backward. This process gives a change in the weight for the edges layer-wise. This standard method used in training MLPs is called the back propagation algorithm.

Formally, the training steps consist of:

Forward Pass: The outputs and the error at the output units are calculated.

Backward Pass: The output unit error is used to alter weights on the output units. Then the error, at the hidden nodes is calculated, and weights on hidden nodes are altered using these values.

For each training data, a forward pass and a backward pass is performed. This is repeated over and over again, until the error is at an acceptably low level.

2.4.3 Training RBF networks

The RBF design involves deciding on their centers and the sharpness (standard deviations) of their Gaussians. Generally, the centers and SDs (standard deviations) are decided by examining the vectors in the training data. RBF networks are trained in a similar way a MLP. The output layer weights are trained using the delta rule.

2.4.4 Competitive Learning

Competitive learning, or winner-takes-all is regarded as the basis of a number of unsupervised learning strategies. It consists of k units with weight vectors wk, of equal dimension to the input data. During learning process, the unit with its weight vector closest to the input vector x is adapted in such a way that the vector becomes closer to the input vector after adaptation. The unit with the closest vector is termed as the winner of the selection process. This learning strategy is generally implemented by gradually reducing the difference between the weight vector and the input vector. The actual amount of reduction at each learning step is guided by means of the learning rate, (. During the learning process, the weight vectors converge towards the mean of the set of input data.

2.4.5 Kohonens SOM

The self-organising map (SOM) was a neural network model developed by Teuvo Kohonen during 1979-82. SOM is one of the most widely used unsupervised NN models and employs competitive learning steps. It consists of a layer of input units, each of which is fully connected to a set of output units. These output units are arranged in some topology (the most common choice is a 2-d grid. The inputs, after receiving the input patterns X, propagate them as they are onto the output units. Each of the output units k is assigned a weight vector wk. During the learning step, the unit c corresponding to the highest activity level w.r.t. a randomly selected input pattern X, is adapted in such a way that it exhibits an even higher activity level at a future presentation of X. This is accomplished by competitive learning.

The similarity metric is chosen to be the Euclidean distance. During the learning steps of SOM, a set of units around the winner is tuned towards the currently presented input pattern enabling a spatial arrangement of the input patterns, such that similar inputs are mapped onto regions close to each other in the grid of output units. Thus, the training process of SOM results in a topological organization of the input patterns. Thus, SOM takes a high-dimensional input and clusters it, but still retains some topological ordering of the output. After training, an input will cause some of the output units in some area to become active. Such clustering (and dimensional reduction) is very useful as a preprocessing stage, whether for neural network data processing, or for more additional techniques.

Fig. 2.7 SOM Architecture

3. NEURAL NETWORK BASED DATA MINING

3.1 Introduction

There is no general theory that specifies the type of neural network, number of layers, number of nodes (at various layers), or learning algorithm for a given problem. As such, todays network builder must experiment with a large number of neural networks before converging upon the appropriate one for the problem in hand.

3.2 The suitability of Neural Networks for Data Mining

For some problems, neural-networks provide a more suitable inductive bias for data mining than competing algorithms.

Inductive Bias: Given a fixed set of training examples, there are infinitely many models that could account for the data, and every learning algorithm has inductive bias that determines the model that it is likely to return. There are two aspects to the inductive bias of an algorithm: the restricted hypothesis bias and the preference bias. The hypothesis bias refers to the constraints that the algorithm places on the hypotheses that it is to construct. For example, the hypothesis space of a perceptron is limited to the linear discriminant functions. The preference bias of an algorithm is the preference ordering that it places on the models within its hypothesis space. For example, most algorithms try to fit a simple hypothesis to a given training set and then progressively explore more complex hypotheses until they find an acceptable fit.

In some cases, the neural networks have a more restricted hypothesis space bias than other learning algorithms. For example, for sequential and temporal prediction tasks represent a class of problems, for which, neural networks present a more appropriate hypothesis space. Recurrent networks, which are applied to most of these problems, are able to maintain state information from one time state to the next. This means that recurrent networks use their hidden units to learn derived information to the relevant task at hand, and they can make use of this derived information at one instant to help make prediction for the next instant.

Although neural networks have an appropriate inductive bias for a wide class of problems, they are not commonly used for data mining tasks. There are two reasons: trained neural networks are usually not comprehensible and many neural network learning methods are slow, making them impractical for very large data sets.

3.3 Challenges Involved

The hypothesis represented by a trained neural network is defined by:

(a) The topology of the network

(b) The transfer functions used for hidden and output units and

(c) The real-valued parameters associated with the network connections (i.e., the weights) and the units (i.e., the biases of sigmoid units).

Such hypotheses are difficult to comprehend for several reasons. First, typical systems have hundreds or thousands of real-valued parameters. These parameters encode the relationships between the input features and target values. Although single-parameter encodings are usually not hard to understand, the sheer number of parameters in a typical network can make the task of understanding them quite difficult. Second, in multi-layer networks, these parameters may represent non-linear, non-monotonic relationships between the input features and the target values. Thus, it is usually not possible to determine, in isolation, the effect of a given feature on the target value, because this effect may be medicated by the values of other features.

These non-linear, non-monotonic relationships are represented by hidden units, which combine the inputs of multiple features, thus allowing the model to take advantage of dependencies among the features. Hidden units can be thought of, as representing higher-level, derived features. Understanding of hidden units is usually difficult because they learn distributed representations. In a distributed representation, the individual units do not correspond to well understood features in a problem domain. Instead, features, which are meaningful in the context of the problem domain, are often encoded by patterns of activation across many hidden units. Similarly, each hidden unit may play a part in representing many derived features.

Consider the issue of learning time required for neural networks. The process of learning in most neural network methods, involves using some gradient-based optimization method to adjust the networks parameters. Such optimization iteratively executes two basic steps: calculating the gradient of the error function (with respect to the networks adjustable parameters) and adjusting the networks parameters in the direction suggested by the gradient. Learning can be quite slow in such methods, because the optimization may involve a large number of small steps, and the cost of calculating the gradient at each step may be quite expensive.

3.4 Advantages

One appealing aspect of many neural-network learning methods is that they are online algorithms, meaning that they update their hypothesis after every example is presented. Because they update their parameters frequently, online neural-network learning algorithms converge much faster than batch algorithms. This is specially the case for large data sets. Often a reasonably good solution can be found in one pass through a large training set. For this reason, we argue that training-time performance of neural-network algorithms may often be acceptable for data mining tasks, especially given the availability of high performance desktop computers.

3.5 Extraction Methods

One approach to understanding the hypothesis represented by a trained neural network is to translate the hypotheses into a more comprehensible language. Various strategies using this approach have been investigated under the rubric of rule extraction.

Some keywords:

Representation Language: It is the language used by the extraction methods to describe the neural networks learned model. The languages that have been used include conjunctive inference rules, fuzzy rules, m-of-n rules, decision trees and finite state automata.

Extraction Strategy: It is the strategy used by the extraction method to map the model represented by the trained neural network into a model in the new representation language. Specifically how the method explores the candidate descriptions and what level of descriptions it uses to characterize the neural network. That is, do the rules extracted by the method describe the behavior as a whole (global methods), or the behaviour of individual units in the network (local methods) or something in between these two cases.

Network Requirements: The architectural and training requirements that the extraction method imposes on neural networks. In other words, the range of networks to which the method is applicable.

3.5.1 Rule Extraction Task

Consider the following example:

y

6 4 4 0 - 4

x1 x2 x3x4 x5

Fig. 3.1A Simple Network

Fig 3.1 illustrates the task of rule extraction with a simple network. This one-layer network has five Boolean inputs and one Boolean output. The extracted symbolic rules specify conditions on the input features, that when satisfy, give a guaranteed output state.

The output unit specifies a threshold function to compute its activation as follows:

ay = 1, if ((i wi ai + ( ) > 0

0, otherwise

The extracted rules are:

Y ( x1 ^ x2 ^ x3

Y ( x1 ^ x2 ^ -x5

Y ( x1 ^ x3 ^ -x5

Whenever a neural network is used for a classification problem, there is always an implicit decision procedure that is used to decide which class is predicted by the given network. In the simple example above, the decision procedure is to simply to predict y = true when the activation of the output unit was 1 and y = false when it was 0.

In general, an extracted rule gives a set of conditions under which the network, coupled with its decision procedure, predicts a given class.

One of the dimensions along which the rule-extraction methods can be characterized is their level of description. One approach is to extract a set of global rules that characterize the output classes directly in terms of the inputs. An alternative approach is to extract a set of local rules, by decomposing the multiplayer networks into a collection of single layer networks.

3.6The TREPAN Algorithm

3.6.1 Introduction

The Trepan algorithm is used for extracting comprehensible, symbolic representations from trained neural networks. The algorithm uses queries to induce a decision tree that approximates the concept represented by a given network. Experiments demonstrate that Trepan is able to produce decision trees that maintain a high level of fidelity to their respective networks while being comprehensible and accurate. Unlike previous work in this area, the algorithm is general in its applicability and scales well to large networks and problems with high-dimensional input spaces.

3.6.2 Extracting Decision Trees

Our approach views the task of extracting a comprehensible concept description from a trained network as an inductive learning problem. In this learning task, the target concept is the function represented by the network, and the concept description produced by our learning algorithm is a decision tree that approximates the network. However, unlike most inductive learning problems, we have available an oracle that is able to answer queries during the learning process. Since the target function is simply the concept represented by the network, the oracle uses the network to answer queries. The advantage of learning with queries, as opposed to ordinary training examples, is that they can be used to garner information precisely where it is needed during the learning process.

Membership Queries and The Oracle: The role of the oracle is to determine the class (as predicted by the network) of each instance that is presented as a query. Queries to the oracle, however, do not have to be complete instances, but instead can specify constraints on the values that the features can take. In the latter case, the oracle generates a complete instance by randomly selecting values for each feature, while ensuring that the constraints are satisfied. In order to generate these random values, Trepan uses the training data to model each feature's marginal distribution. Trepan uses frequency counts to model the distributions of discrete-valued features, and a kernel density estimation method (Silverman, 1986) to model continuous features. The oracle is used for three different purposes: (i) to determine the class labels for the network's training examples; (ii) to select splits for each of the tree's internal nodes; (iii) and to determine if a node covers instances of only one class.

Tree Expansion. Unlike most decision-tree algorithms, which grow trees in a depth-first manner, Trepan grows trees using a best-first expansion.

Split Types. The role of internal nodes in a decision tree is to partition the input space in order to increase the separation of instances of different classes. This algorithm forms trees that use m-of-n expressions for its splits. An m-of-n expression is a Boolean expression that is specified by an integer threshold, m, and a set of n Boolean conditions. An m-of-n expression is satisfied when at least m of its n conditions are satisfied. For example, suppose we have three Boolean features, a, b, and c; the m-of-n expression 2-of-fa; :b; cg is logically equivalent to (a ^ :b) . (a ^ c) . (:b ^ c).

Split Selection. Split selection involves deciding how to partition the input space at a given internal node in the tree. A limitation of conventional tree-induction algorithms is that the amount of training data used to select splits decreases with the depth of the tree. Thus splits near the bottom of a tree are often poorly chosen because these decisions are based on few training examples. In contrast, because Trepan has an oracle available, it is able to use as many instances as desired to select each split.

Stopping Criteria. Trepan uses two separate criteria to decide when to stop growing an extracted decision tree. First, a given node becomes a leaf in the tree if, with high probability, the node covers only instances of a single class. To make this decision, Trepan determines the proportion of examples that fall into the most common class at a given node, and then calculates a confidence interval around this proportion.

Trepan also accepts a parameter that specifies a limit on the number of internal nodes in an extracted tree. This parameter can be used to control the comprehensibility of extracted trees, since in some domains, it may require very large trees to describe networks to a high level of fidelity.

3.6.3 Algorithm

Input:Oracle(), training set S, feature set F, min_sample

Initialize root of the tree, R, as root node

/* get a sample of instances */

use S to construct a model MR of the distribution of instances covered by node R

q := max(0, min_sample - | S | )

query_instancesR := a set of q instances generated using model MR/* use the network to label all instances */

for each example x ( (S U query_instancesR)

class label for x := Oracle(x)

/* do a best-first expansion of the tree */

initialize queue with tuple (R, S, query_instancesR, {})

while queue not empty and global stopping criteria not satisfied

/* make node of at head of queue into an internal node */

remove (node N, SN, query_instancesN, constraintsN) from head of queue

use F, SN, and query_instancesN, to construct a splitting test T at node N

/* make children nodes */

for each outcome t, of test T

make C, a new child node of N

constraintsC := constraintsN U {T=t}

/* get a sample of instances for the node C */

SC := members of SN with outcome t on test T

Construct a model MC of the distribution of instances covered by node C

q := max(0, min_sample - | SC | )

query_instancesC := a set of q instances generated using model MC and constraintsC

for each example x ( query_instancesC

class label for x := Oracle(x)

/* make node C a leaf node for now */

use SC and query_instancesC to determine a class label for C

/* determine if node C should be expanded */

if local stopping criteria not satisfied then

put (C, SC, quey_instancesC, constraintsC) into queue

Return: tree with root R

4.CONCLUSION

The advent of Data Mining is only the latest step in the extension of quantitative, "scientific" methods to business. It empowers every nonstatistician- that is 99.9% of us all- to study, understand and improve our operational or decisional processes whether in science, business or society. For the first time, thanks to the increased power of computers, new methods replace the skill of the statistical artisan with massive-computational methods, obtaining equal or better results in far less time without the need for any specialised knowledge.

Data Mining is probably the most useful way to take advantage of the massive processing power available on many desktop computers, and the definitely most promising and exciting research field in Advanced Informatics.

Neural Networks algorithms are among the most popular data mining and machine learning techniques used today. As computers become faster, the neural net methodology is replacing many traditional tools in the field of knowledge discovery and some related fields.

A significant limitation of neural networks is that their concept representations are usually not amenable to human understanding. We have presented an algorithm that is able to produce comprehensible descriptions of trained networks by extracting decision trees that accurately describe the networks' concept representations. We believe that our algorithm, which takes advantage of the fact that a trained network can be queried, represents a promising advance towards the goal of general methods for understanding the solutions encoded by trained networks.

One of the principal strengths of Trepan is its generality. In contrast to most rule extraction algorithms, it makes few assumptions about the architecture of a given network, and it does not require a special training method for the network. Moreover, Trepan is able to handle tasks that involve both discrete-valued and continuous-valued features.

REFERENCES

[1] IEEE Transactions on Neural Networks; Data Mining in a Soft Computing Framework: A Survey, Authors: Sushmita Mitra, Sankar K. Pal and Pabitra Mitra. (January 2002, Vol. 13, No. 1)

[2] Using Neural Networks for Data Mining: Mark W. Craven, Jude W. Shavlik

[3] Data Mining Techniques: Arjun K. Pujari

[4] Introduction to the theory of Neural Computation: John Hertz, Anders Krogh, Richard G. Palmer

[5] Elements of Artificial Neural Networks: Kishan Mehrotra, Chilukuri K. Mohan, Sanjay Ranka.

[6] Artificial Neural Networks: Galgotia Publication

[7] Neural Networks based Data Mining and Knowledge Discovery in Inventory Applications: Kanti Bansal, Sanjeev Vadhavkar, Amar Gupta

[8] Data Mining, An Introduction: Ruth Dilly, Parallel Computer Centre, Queens University Belfast: http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_1.html[9] Introduction to Backpropagation Neural Networks: http://cortex.snowseed.com/index.html[10] Data Mining Techniques: Electronic textbook, Statsoft: http://www.statsoftinc.com/textbook/stdatmin.html#neuralConditional Logic

Affinities & Associations

Discovery

Trends & Variations

Outcome Prediction

Predictive

Modeling

Data Mining

Forecasting

Deviation Detection

Forensic Analysis

Link Analysis

Preprocessing

Selection

Data

Preprocessed

Data

Target Data

Transformation

Knowledge

Patterns

Transformed

Data

Interpretation & Evaluation

Data Mining

I

N

P

U

T

S

wi1

O

U

T

P

U

T

Activation

Function

Linear Combiner

wi2

wi3

Hidden Node

Hidden Node

Input Node

Input Node

Output Node

Input Node

Output Node

Input Node

Output Node

Output Node

High dimensional input X

2-D Array of output units

wk

( = -9

PAGE 31

Final

Documents

Transcript of Final