Artificial Intelligence

205
Unit-1 Overview and Search Technique Introduction to Artificial Intelligence Definition of AI What is AI? Artificial Intelligence is concerned with the design of intelligence in an artificial device. The term was coined by McCarthy in 1956. There are two ideas in the definition. 1. Intelligence 2. Artificial device What is intelligence? – Is it that which characterize humans? Or is there an absolute standard of judgment? – Accordingly there are two possibilities: – A system with intelligence is expected to behave as intelligently as a human – A system with intelligence is expected to behave in the best possible manner Page 1

description

about ai

Transcript of Artificial Intelligence

Unit-1 Overview and Search Technique Introduction to Artificial IntelligenceDefinition of AI What is AI? Artificial Intelligence is concerned with the design of intelligence in an artificial device. The term was coined by McCarthy in 1956. There are two ideas in the definition. 1. Intelligence 2. Artificial device What is intelligence? Is it that which characterize humans? Or is there an absolute standard of judgment? Accordingly there are two possibilities: A system with intelligence is expected to behave as intelligently as a human A system with intelligence is expected to behave in the best possible manner Secondly what type of behavior are we talking about? Are we looking at the thought process or reasoning ability of the system? Or are we only interested in the final manifestations of the system in terms of its actions? Given this scenario different interpretations have been used by different researchers as defining the scope and view of Artificial Intelligence.Page 1

1. One view is that artificial intelligence is about designing systems that are as intelligent as humans. This view involves trying to understand human thought and an effort to build machines that emulate the human thought process. This view is the cognitive science approach to AI. 2. The second approach is best embodied by the concept of the Turing Test. Turing held that in future computers can be programmed to acquire abilities rivaling human intelligence. As part of his argument Turing put forward the idea of an 'imitation game', in which a human being and a computer would be interrogated under conditions where the interrogator would not know which was which, the communication being entirely by textual messages. Turing argued that if the interrogator could not distinguish them by questioning, then it would be unreasonable not to call the computer intelligent. Turing's 'imitation game' is now usually called 'the Turing test' for intelligence. 3. Logic and laws of thought deals with studies of ideal or rational thought process and inference. The emphasis in this case is on the inference mechanism, and its properties. That is how the system arrives at a conclusion, or the reasoning behind its selection of actions is very important in this point of view. The soundness and completeness of the inference mechanisms are important here. 4. The fourth view of AI is that it is the study of rational agents. This view deals with building machines that act rationally. The focus is on how the system acts and performs, and not so much on the reasoning process. A rational agent is one that acts rationally, that is, is in the best possible manner.Page 2

Problem Solving:Strong AI aims to build machines that can truly reason and solve problems. These machines should be self aware and their overall intellectual ability needs to be indistinguishable from that of a human being. Excessive optimism in the 1950s and 1960s concerning strong AI has given way to an appreciation of the extreme difficulty of the problem. Strong AI maintains that suitably programmed machines are capable of cognitive mental states. Weak AI: deals with the creation of some form of computerbased artificial intelligence that cannot truly reason and solve problems, but can act as if it were intelligent. Weak AI holds that suitably programmed machines can simulate human cognition. Applied AI: Aims to produce commercially viable "smart" systems such as, for example, a security system that is able to recognize the faces of people who are permitted to enter a particular building. Applied AI has already enjoyed considerable success. Cognitive AI: computers are used to test theories about how the human mind works--for example, theories about how we recognize faces and other objects, or about how we solve abstract problems. Best First Search: Best First search, which is a way of combining the advantages of both depth-first and breadth-firstsearch into a single method. One way of combining the two is to follow a single path at a time, but switch paths whenever some competing path looks more promising than the current one does.Page 3

At each step of the best-first-search process, we select the most promising of the nodes we have generated so far. This is done by applying an appropriate heuristic to each of them. We then expanded the chosen node by using the rules to generate its successors. If one of them is a solution, we can quit. If not, all those new nodes are added to the set of nodes generated so far. Again the most promising node is selected and the process continues. Fig shows the beginning of a best-first search procedure. Initially, there is only one node, so it will be expanded. Doing so generates three new nodes. The heuristic function, which, in this example, is an estimate of the cost of getting to a solution from a given node, is applied to each of these new nodes. Since node D is the most promising, if it is expanded next, producing two successor nodes, E and F. But then the heuristic function is applied to them. Now another path, that going through node B, looks more promising, so it is pursued, generating nodes G and H. But again when these new nodes are evaluated they look less promising than another path, so attention is returned to the pass through D to E. E is then expanded, yielding nodes I and J. At the next step, j will be expanded, since it is the most promising. This process can continue until a solution is found. Step 1 Step2 Step3A A A

B

C

D

B

C

D

Page 4

E

F

Step4A

Step5A

D B

C

D

B

C

D

G

H

E

F

G

H

E

F

I

J

Fig: A Best-first Search The actual operation of the algorithm is very simple. It proceeds in steps, expanding one node at each step, until itPage 5

generates a node that corresponds to a goal state. At each step, it picks the most promising of the nodes that have so far been generated but not expanded. It generates the successors of the chosen node, applies the heuristic function to them, and adds them to the list of open nodes, after checking to see if any of them have been generated before. By doing this check, we can guarantee that each node only appears once in this graph, although many nodes may point to it as a successor. Then the next step begins. The process can be summarized as follows. Algorithm: Best-First Search 1. Start with OPEN containing just the initial state. 2. Until a goal is found or there are no nodes left on OPEN do: (a) Pick the best node on OPEN. (b) Generate its successors. (c) For each successor do: i. If it has not been generated before, evaluate it, add it to OPEN, and record its parent. ii. If it has been generated before, change the parent if this new path is better than the previous one. In that case,Page 6

update the cost of getting this node and to any successors that this node may already here.

A* Search: The best-first search algorithm that was justpresented is a simplification of an algorithm called A*, which was first presented by Hart et al. [1968; 1972]. This algorithm uses the same f, g, and h functions, as well as the lists OPEN and CLOSED The form of heuristic estimation function for A* is f*(n) =g*(n)+h*(n) Where the two components g*(n) and h*(n) are estimates of the cost (or distance) from the start node to node n and the cost form node n to a goal node, respectively. Nodes on the open list are nodes that have been generated but not yet expanded while nodes on the closed list are nodes have been expanded and whose children are, therefore, available to the search program. The A* algorithm proceeds as follows Algorithm: A* 1. Place the starting node s on open. 2. If open is empty, stop and return failure.

Page 7

3. Remove from open the node n that has the smallest value of f*(n). If the node is a goal node, return success and stop. Otherwise, 4. Expand n, generating all of its successors n and place n on closed. For every successor n, if n is not already on open or closed attach a back-pointer to n, compute f*(n) and place it an open. 5. Each n that is already on open or closed should be attached to back-pointers which reflect the lowest g*(n) path. If n was on closed and its pointer was changed, remove it and place it on open. 6. Return to step2. It has been shown that the A* algorithm is both complete and admissible. Thus, A* will always find an optimal path if one exists. The efficiency of an A* algorithm depends on how closely h* approximates h and the cost of the computing f*.

AO* algorithm:1. Place the starting node s on open2.

Using the search tree constructed thus far, compute the most promising solution tree To. Select a node n that is both on open and a part of To. Remove n from open and place it on closed.Page 8

3.

4. If n is a terminal goal node, label n as solved. If the solution of n results in any of ns ancestors being solved, label all the ancestors as solved. If the start node s is solved, exit with success where To is the solution tree. Remove from open all nodes with a solved ancestor. 5. If n is not a solvable node (operators cannot be applied), label n as unsolvable. If the start node is labeled as unsolvable, exist with failure. If any of ns ancestors become unsolvable because n is, label them unsolvable as well. Remove from open all nodes with unsolvable ancestors.6.

Otherwise, expand node n generating all of its successors. For each such successor node that contains more than one sub problem, generate their successors to give individual sub problems. Attach to each newly generated node a back pointer to its predecessor. Compute the cost estimate h* for each newly generated node and place all such nodes that do not yet have descendents on open. Next, recomputed the values of h* at n and each ancestor of n.

7. Return to step2.

Hill Climbing Search: Hill climbing gets their namesfrom the way the nodes are selected for expansion. At each point in the search path, a successor node that appears to lead most quickly to the top of the hill (the goal) is selected forPage 9

exploration. This method requires that some information be available with which to evaluate and order the most promising choices. Hill climbing is like depth-first searching where the most promising child is selected for expansion. When the children have been generated, alternative choices are evaluated using some type of heuristic function. The path that appears most promising is then chosen and no further reference to the parent or other children is retained. This process continues from node-to-node with previously expanded nodes being discarded. Hill climbing can produce substantial savings over blind searches when an informative, reliable function is available to guide the search to a global goal. It suffers from some serious drawbacks when this is not the case. Potential problem types named after certain terrestrial anomalies are the foothill, ridge, and plateau traps. The foothill trap results when local maxima or peaks are found. In this case the children all have less promising goal distances than the parent node. The search is essentially trapped at the local node with no indication of goal direction. The only way to remedy this problem is to try moving in some arbitrary direction a few generations in the hope that the

Page 10

real goal direction will become evident, backtracking to an ancestor node and trying a secondary path choice. A second potential problem occurs when several adjoining nodes have higher values than surrounding nodes. This is the equivalent of a ridge. The search may encounter a plateau type of structure, that is, an area in which all neighboring nodes have the same values. Once again, one of the methods noted above must be tried to escape the trap.

Breadth-First Search: Breadth-first searches areperformed by exploring all nodes at a given depth before proceeding to the next level. This mans that all immediate children of nodes are explored before any of the childrens children are considered. Breadth first tree search is illustrated in fig. It has the obvious advantage of always finding a minimal path length solution where one exists. A great many nodes may need to be explored before a solution is found, especially if the tree is very full. It uses a queue structure to hold all generated but still unexplored nodes. The breadthfirst algorithm proceeds as follows. BREADTH-FIRST SEARCH 1. Place the starting node s on the queue. 2. If the queue is empty, return failure and stop.

Page 11

3. If the first element on the queue is a goal node g, return success and stop. Otherwise, 4. Remove and expand the first element from the queue and place all the children at the end of the queue in any order. 5. Return to step2.

Start

Mini-max search: The mini-max search procedure is a depthlimited search procedure. The idea is to start at the current position and use the plausible-move generator to generate the set of possible successor positions. Now we can apply the static evaluation function to those positions and simply choose the best one.Page 12

The starting position is exactly as good for us as the position generated by the best move we can make next. Here we assume that the static evaluation function returns large values to indicate good situations for us, so our goal is to maximize the value of the static evaluation function of the next board position. An example of this operation is shown in fig1. It assumes a static evaluation function that returns values ranging from -10 to 10, with 10 indicating a win for us, -10 a win for the opponent, and 0 an even match. Since our goal is to maximize the value of the heuristic function, we choose to move to B. backing Bs value up to A, we can conclude that As value is 8, since we know we can move to a position with a value of 8.A A

(8) B(8)

C (3)

D (-2) E F

B

C

D

G

H

I

(9)

(-6) (0)

(0)

(-2) (-4) (3)

JJJJ F

Fig1. One ply search and two ply search But since we know that the static evaluation function is not completely accurate, we would like to carry the search fartherPage 13

ahead than one ply. This could be very important for example, in a chess game in which we are in the middle of a piece exchange. After our move, the situation would appear to be very good, but if we look one move ahead, we will see that one of our pieces also gets captured and so the situation is not as seemed. Once the values from the second ply are backed up, it becomes clear that the correct move for us to make at the first level, given the information we have available, is C, since there is nothing the opponent can do from there to produce a value worse than -2. This process can be repeated for as many ply as time allows, and the more accurate evaluations that are produced can be used to choose the correct move at the top level. The alteration of maximizing and minimizing at alternate ply when evaluations are being pushed back up corresponds to the opposing strategies of the two players and gives this method the name minimax. Having described informally the operation of the minimax procedure, we now describe it precisely. It is a straight forward recursive procedure that relies on two auxiliary procedures that are specific to the game being played: 1. MOVGEN (position, player) - The plausible-move generator, which returns a list of nodes representing the moves that can be made by player in position. We call the two players PLAYERONE and PLAYER-TWO; in a chess program, we might use the names BLACK and WHITE instead. 2.STATIC(position, player) the static evaluation function, which returns a number representing the goodness of position from the standpoint of player.2

Page 14

Heuristic functionA heuristic function, or simply a heuristic, is a function that ranks alternatives in various search algorithms at each branching step based on the available information (heuristically) in order to make a decision about which branch to follow during a search. Shortest paths For example, for shortest path problems, a heuristic is a function, h(n) defined on the nodes of a search tree, which serves as an estimate of the cost of the cheapest path from that node to the goal node. Heuristics are used by informed search algorithms such as Greedy best-first search and A* to choose the best node to explore. Greedy best-first search will choose the node that has the lowest value for the heuristic function. A* search will expand nodes that have the lowest value for g(n) + h(n), where g(n) is the (exact) cost of the path from the initial state to the current node. If h(n) is admissiblethat is, if h(n) never overestimates the costs of reaching the goal, then A* will always find an optimal solution. The classical problem involving heuristics is the n-puzzle. Commonly used heuristics for this problem include counting the number of misplaced tiles and finding the sum of the Manhattan distances between each block and its position in the goal configuration. Note that both are admissible.

Page 15

Effect of heuristics on computational performance In any searching problem where there are b choices at each node and a depth of d at the goal node, a naive searching algorithm would have to potentially search around bd nodes before finding a solution. Heuristics improve the efficiency of search algorithms by reducing the branching factor from b to a lower constant b', using a cutoff mechanism. The branching factor can be used for defining a partial order on the heuristics, such that h1(n) < h2(n) if h1(n) has a lower branch factor than h2(n) for a given node n of the search tree. Heuristics giving lower branching factors at every node in the search tree are preferred for the resolution of a particular problem, as they are more computationally efficient. Finding heuristics The problem of finding an admissible heuristic with a low branching factor for common search tasks has been extensively researched in the artificial intelligence community. Several common techniques are used:

Solution costs of sub-problems often serve as useful estimates of the overall solution cost. These are always admissible. For example, a heuristic for a 10-puzzle might be the cost of moving tiles 1-5 into their correct places. A common idea is to use a pattern database that stores the exact solution cost of every sub problem instance.Page 16

The solution of a relaxed problem often serves as a useful admissible estimate of the original. For example, Manhattan distance is a relaxed version of the n-puzzle problem, because we assume we can move each tile to its position independently of moving the other tiles. Given a set of admissible heuristic functions h1(n),h2(n),...,hi(n), the function h(n) = max{h1(n),h2(n),...,hi(n)} is an admissible heuristic that dominates all of them.

Using these techniques a program called ABSOLVER was written (1993) by A.E. Prieditis for automatically generating heuristics for a given problem. ABSOLVER generated a new heuristic for the 8-puzzle better than any pre-existing heuristic and found the first useful heuristic for solving the Rubik's Cube. Consistency and Admissibility If a Heuristic function never over-estimates the cost reaching to goal, then it is called an Admissible heuristic function. If H(n) is consistent then the value of H(n) for each node along a path to goal node are non decreasing.

Alpha-beta pruning: Alpha-beta pruning is a searchalgorithm which seeks to reduce the number of nodes that are evaluated by the min max algorithm in its search tree. It is a search with adversary algorithm used commonly for machine playing of two-player games (Tic-tac-toe, Chess, Go, etc.). It stops completely evaluating a move when at leastPage 17

one possibility has been found that proves the move to be worse than a previously examined move. Such moves need not be evaluated further. Alpha-beta pruning is a sound optimization in that it does not change the score of the result of the algorithm it optimize History Allen Newell and Herbert Simon who used what John McCarthy calls an "approximation"[1] in 1958 wrote that alpha-beta "appears to have been reinvented a number of times".[2] Arthur Samuel had an early version and Richards, Hart, Levine and/or Edwards found alpha-beta independently in the United States.[3] McCarthy proposed similar ideas during the Dartmouth Conference in 1956 and suggested it to a group of his students including Alan Kotok at MIT in 1961.[4] Alexander Brudno independently discovered the alpha-beta algorithm, publishing his results in 1963.[5] Donald Knuth and Ronald W. Moore refined the algorithm in 1975[6][7] and it continued to be advanced. Improvements over naive mini max An illustration of alpha-beta pruning. The grayed-out sub trees need not be explored (when moves are evaluated from left to right), since we know the group of sub trees as a whole yields the value of an equivalent sub tree or worse, and as such cannot influence the final result. The max and min levels represent the turn of the player and the adversary, respectively.

Page 18

The benefit of alpha-beta pruning lies in the fact that branches of the search tree can be eliminated. The search time can in this way be limited to the 'more promising' sub tree, and a deeper search can be performed in the same time. Like its predecessor, it belongs to the branch and bound class of algorithms. The optimization reduces the effective depth to slightly more than half that of simple mini max if the nodes are evaluated in an optimal or near optimal order (best choice for side on move ordered first at each node). With an (average or constant) branching factor of b, and a search depth of d plies, the maximum number of leaf node positions evaluated (when the move ordering is pessimal) is O(b*b*...*b) = O(bd) the same as a simple mini max search. If the move ordering for the search is optimal (meaning the best moves are always searched first), the number of leaf node positions evaluated is about O(b*1*b*1*...*b) for odd depth andPage 19

O(b*1*b*1*...*1) for even depth, or . In the latter case, where the ply of a search is even, the effective branching factor is reduced to its square root, or, equivalently, the search can go twice as deep with the same amount of computation.The explanation of b*1*b*1*... is that all the first player's moves must be studied to find the best one, but for each, only the best second player's move is needed to refute all but the first (and best) first player move alpha-beta ensures no other second player moves need be considered. If b=40 (as in chess), and the search depth is 12 plies, the ratio between optimal and pessimal sorting is a factor of nearly 406 or about 4 billion times. An animated pedagogical example that attempts to be human-friendly by substituting initial infinite (or arbitrarily large) values for emptiness and by avoiding using the nega max coding simplifications. Normally during alpha-beta, the sub trees are temporarily dominated by either a first player advantage (when many first player moves are good, and at each search depth the first move checked by the first player is adequate, but all second player responses are required to try and find a refutation), or vice versa. This advantage can switch sides many times during the search if the move ordering is incorrect, each time leading to inefficiency. As the number of positions searched decreases exponentially each move nearer the current position, it is worth spending considerable effort on sorting early moves. An improved sort at any depth will exponentially reduce the total number of positions searched, but sorting all positions at depths near the root node is relatively cheap as there are so few of them. InPage 20

practice, the move ordering is often determined by the results of earlier, smaller searches, such as through iterative deepening. The algorithm maintains two values, alpha and beta, which represents the minimum score that the maximizing player is assured of and the maximum score that the minimizing player, is assured of respectively. Initially alpha is negative infinity and beta is positive infinity. As the recursion progresses the "window" becomes smaller. When beta becomes less than alpha, it means that the current position cannot be the result of best play by both players and hence need not be explored further. Additionally, this algorithm can be trivially modified to return an entire principal variation in addition to the score. Some more aggressive algorithms such as MTD(f) do not easily permit such a modification. Pseudo code Function alpha beta (node, depth, , , Player) if depth = 0 or node is a terminal node return the heuristic value of node If Player = Max Player for each child of node := max(, alpha beta(child, depth-1, , , not(Player) )) if Break (* Beta cut-off *) return else for each child of node := min(, alpha beta(child, depth-1, , , not(Player) ))Page 21

if break (* Alpha cut-off *) return (* Initial call *) Alpha beta (origin, depth, -infinity, +infinity, Max Player) Heuristic improvements Alpha-beta search can be made even faster by considering only a narrow search window (generally determined by guesswork based on experience). This is known as aspiration search. In the extreme case, the search is performed with alpha and beta equal; a technique known as zero-window search, null-window search, or scout search. This is particularly useful for win/loss searches near the end of a game where the extra depth gained from the narrow window and a simple win/loss evaluation function may lead to a conclusive result. If an aspiration search fails, it is straightforward to detect whether it failed high (high edge of window was too low) or low (lower edge of window was too high). This gives information about what window values might be useful in a re-search of the position.

Constraint Satisfaction, Constraint satisfaction is theprocess of finding a solution to a set of constraints that impose conditions that the variables must satisfy. A solution is therefore a vector of variables that satisfies all constraints. The techniques used in constraint satisfaction depend on the kind of constraints being considered. Often used are constraints on a finite domain, to the point that constraint satisfaction problems are typically identified with problems based onPage 22

constraints on a finite domain. Such problems are usually solved via search, in particular a form of backtracking or local search. Constraint propagation are other methods used on such problems; most of them are incomplete in general, that is, they may solve the problem or prove it un satisfiable, but not always. Constraint propagation methods are also used in conjunction with search to make a given problem simpler to solve. Other considered kinds of constraints are on real or rational numbers; solving problems on these constraints is done via variable elimination or the simplex algorithm. Constraint satisfaction originated in the field of artificial intelligence in the 1970s (see for example (Laurire 1978)). During the 1980s and 1990s, embedding of constraints into a programming language were developed. Languages often used for constraint programming are Prolog and C++.

Page 23

Constraint satisfaction problem Constraints enumerate the possible values a set of variables may take. Informally, a finite domain is a finite set of arbitrary elements. A constraint satisfaction problem on such domain contains a set of variables whose values can only be taken from the domain, and a set of constraints, each constraint specifying the allowed values for a group of variables. A solution to this problem is an evaluation of the variables that satisfies all constraints. In other words, a solution is a way for assigning a value to each variable in such a way that all constraints are satisfied by these values. In practice, constraints are often expressed in compact form, rather than enumerating all values of the variables that would satisfy the constraint. One of the most used constraints is the one establishing that the values of the affected variables must be all different. Problems that can be expressed as constraint satisfaction problems are the Eight queens puzzle, the Sudoku solving problem, the Boolean satisfiability problem, scheduling problems and various problems on graphs such as the graph coloring problem. While usually not included in the above definition of a constraint satisfaction problem, arithmetic equations and inequalities bound the values of the variables they contain and can therefore be considered a form of constraints. Their domainPage 24

is the set of numbers (either integer, rational, or real), which is infinite: therefore, the relations of these constraints may be infinite as well; for example, X = Y + 1 has an infinite number of pairs of satisfying values. Arithmetic equations and inequalities are often not considered within the definition of a "constraint satisfaction problem", which is limited to finite domains. They are however used often in constraint programming. Solving Constraint satisfaction problems on finite domains are typically solved using a form of search. The most used techniques are variants of backtracking, constraint propagation, and local search. These techniques are used on problems with nonlinear constraints. Variable elimination and the simplex algorithm are used for solving linear and polynomial equations and inequalities, and problems containing variables with infinite domain. These are typically solved as optimization problems in which the optimized function is the number of violated constraints. Complexity Solving a constraint satisfaction problem on a finite domain is an NP complete problem. Research has shown a number of tractable sub cases, some limiting the allowed constraint relations, some requiring the scopes of constraints to form a tree, possibly in a reformulated version of the problem. Research has also established relationship of the constraint satisfaction

Page 25

problem with problems in other areas such as finite model theory.

Constraint programming Constraint programming is the use of constraints as a programming language to encode and solve problems. This is often done by embedding constraints into a programming language, which is called the host language. Constraint programming originated from a formalization of equalities of terms in Prolog II, leading to a general framework for embedding constraints into a logic programming language. The most common host languages are Prolog, C++, and Java, but other languages have been used as well. Constraint logic programming A constraint logic program is a logic program that contains constraints in the bodies of clauses. As an example, the clause A(X):-X>0,B(X) is a clause containing the constraint X>0 in the body. Constraints can also be present in the goal. The constraints in the goal and in the clauses used to prove the goal are accumulated into a set called constraint store. This set contains the constraints the interpreter has assumed satisfiable in order to proceed in the evaluation. As a result, if this set is detected un satisfiable, the interpreter backtracks. Equations of terms, as used in logic programming, are considered a particular form ofPage 26

constraints which can be simplified using unification. As a result, the constraint store can be considered an extension of the concept of substitution that is used in regular logic programming. The most common kinds of constraints used in constraint logic programming are constraints over integers/rational/real numbers and constraints over finite domains. Concurrent constraint logic programming languages have also been developed. They significantly differ from non-concurrent constraint logic programming in that they are aimed at programming concurrent processes that may not terminate. Constraint handling rules can be seen as a form of concurrent constraint logic programming, but are also sometimes used within a non-concurrent constraint logic programming language. They allow for rewriting constraints or to infer new ones based on the truth of conditions. Constraint satisfaction toolkits Constraint satisfaction toolkits are software libraries for imperative programming languages that are used to encode and solve a constraint satisfaction problem.

Cassowary constraint solver is an open source project for constraint satisfaction (accessible from C, Java, Python and other languages). Comet, a commercial programming language and toolkit

Page 27

Gecode, an open source portable toolkit written in C++ developed as a production-quality and highly efficient implementation of a complete theoretical background. JaCoP (solver) an open source Java constraint solver Koalog a commercial Java-based constraint solver. logilab-constraint an open source constraint solver written in pure Python with constraint propagation algorithms. MINION an open-source constraint solver written in C++, with a small language for the purpose of specifying models/problems. ZDC is an open source program developed in the Computer-Aided Constraint Satisfaction Project for modeling and solving constraint satisfaction problems.

Other constraint programming languages Constraint toolkits are a way for embedding constraints into an imperative programming language. However, they are only used as external libraries for encoding and solving problems. An approach in which constraints are integrated into an imperative programming language is taken in the Kaleidoscope programming language. Constraints have also been embedded into functional programming languages.

Evaluation function: An evaluation function, also knownas a heuristic evaluation function or static evaluation function, is a function used by game-playing programs to estimate the value or goodness of a position in the mini max andPage 28

related algorithms. The evaluation function is typically designed to be prioritize speed over accuracy; the function looks only at the current position and does not explore possible move In chess One popular strategy for constructing evaluation functions is as a weighted sum of various factors that are thought to influence the value of a position. For instance, an evaluation function for chess might take the form c1 * material + c2 * mobility + c3 * king safety + c4 * center control +... Chess beginners, as well as the simplest of chess programs, evaluate the position taking only "material" into account, i.e. they assign a numerical score for each piece (with pieces of opposite color having scores of opposite sign) and sum up the score over all the pieces on the board. On the whole, computer evaluation functions of even advanced programs tend to be more materialistic than human evaluations. This is compensated for by the increased speed of evaluation, which allows more plies to be examined. As a result, some chess programs may rely too much on tactics at the expense of strategy.

Game tree: a game tree is a directed graph whose nodes arepositions in a game and whose edges are moves. The complete game tree for a game is the game tree starting at the initial position and containing all possible moves from each position.

Page 29

The first two ply of the game tree for tic-tac-toe. The diagram shows the first two levels, or ply, in the game tree for tic-tac-toe. We consider all the rotations and reflections of positions as being equivalent, so the first player has three choices of move: in the center, at the edge, or in the corner. The second player has two choices for the reply if the first player played in the center, otherwise five choices. And so on. The number of leaf nodes in the complete game tree is the number of possible different ways the game can be played. For example, the game tree for tic-tac-toe has 26,830 leaf nodes. Game trees are important in artificial intelligence because one way to pick the best move in a game is to search the game tree using the mini max algorithm or its variants. The game tree for tic-tac-toe is easily searchable, but the complete game trees for larger games like chess are much too large to search. Instead, a chess-playing program searches a partial game tree: typically as many ply from the current position as it can search in the timePage 30

available. Except for the case of "pathological" game trees [1] (which seem to be quite rare in practice), increasing the search depth (i.e., the number of ply searched) generally improves the chance of picking the best move. Two-person games can also be represented as and-or trees. For the first player to win a game there must exist a winning move for all moves of the second player. This is represented in the and-or tree by using disjunction to represent the first player's alternative moves and using conjunction to represent all of the second player's moves. Solving Game Trees

An arbitrary game tree that has been fully colored With a complete game tree, it is possible to "solve" the game that is to say, find a sequence of moves that either the first or

Page 31

second player can follow that will guarantee either a win or tie. The algorithm can be described recursively as follows. 1. Color the final ply of the game tree so that all wins for player 1 are colored one way, all wins for player 2 are colored another way, and all ties are colored a third way. 2. Look at the next ply up. If there exists a node colored opposite as the current player, color this node for that player as well. If all immediately lower nodes are colored for the same player, color this node for the same player as well. Otherwise, color this node a tie. 3. Repeat for each ply, moving upwards, until all nodes are colored. The color of the root node will determine the nature of the game. The diagram shows a game tree for an arbitrary game, colored using the above algorithm. It is usually possible to solve a game (in this technical sense of "solve") using only a subset of the game tree, since in many games a move need not be analyzed if there is another move that is better for the same player (for example alpha-beta pruning can be used in many deterministic games). Any sub tree that can be used to solve the game is known as a decision tree, and the sizes of decision trees of various shapes are used as measures of game complexity.

Game of chance: A game of chance is a game whoseoutcome is strongly influenced by some randomizing device,Page 32

and upon which contestants may or may not wager money or anything of monetary value. Common devices used include dice, spinning tops, playing cards, roulette wheels or numbered balls drawn from a container. Any game of chance that involves anything of monetary value is gambling. Gambling is known in nearly all human societies, even though many have passed laws restricting it. Early people used the knucklebones of sheep as dice. Some people develop a psychological addiction to gambling, and will risk even food and shelter to continue. Some games of chance may also involve a certain degree of skill. This is especially true where the player or players have decisions to make based upon previous or incomplete knowledge, such as poker and blackjack. In other games like roulette and baccarat the player may only choose the amount of bet and the thing he/she wants to bet on, the rest is up to chance, therefore these games are still considered games of chance with small amount of skills required [1]. The distinction between 'chance' and 'skill' is relevant as in some countries chance games are illegal or at least regulated, where skill games are not

Page 33

Unit-2 Knowledge Representation Introduction to Knowledge Representation (KR)We argue that the notion can best be understood in terms of five distinct roles it plays, each crucial to the task at hand: A knowledge representation (KR) is most fundamentally a surrogate, a substitute for the thing itself, used to enable an entity to determine consequences by thinking rather than acting, i.e., by reasoning about the world rather than taking action in it It is a set of ontological commitments, i.e., an answer to the question: In what terms should I think about the world? It is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the representations fundamental conception of intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of inferences it recommends. It is a medium for pragmatically efficient computation, i.e., the computational environment in which thinking is accomplished. One contribution to this pragmaticPage 34

efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. It is a medium of human expression, i.e., a language in which we say things about the words Knowledge representation is needed for library classification and for processing concepts in an information system. In the field of artificial intelligence, problem solving can be simplified by an appropriate choice of knowledge representation. Representing the knowledge in one way may make the solution simple, while an unfortunate choice of representation may make the solution difficult or obscure; the analogy is to make computations in Hindu-Arabic numerals or in Roman numerals; long division is simpler in one and harder in the other. Likewise, there is no representation that can serve all purposes or make every problem equally approachable. Properties for Knowledge Representation Systems The following properties should be possessed by a knowledge representation system. Representational Adequacy the ability to represent the required knowledge; Inferential Adequacy - the ability to manipulate the knowledge represented to produce new knowledge corresponding to that inferred from the original; Inferential Efficiency - the ability to direct the inferential mechanisms into the most productive directions by storing appropriate guides;Page 35

Acquisition Efficiency - the ability to acquire new knowledge using automatic methods wherever possible rather than reliance on human intervention.

Predicate Logic: Propositional logic combines atoms Anatom contains no propositional connectives Have no structure (today_is_wet, john_likes_apples) about objects

Predicates allow us to talk Properties: is wet (today) Relations: likes (john, apples)

True or false is a predicate e.g. first order logic, higher-order logic First Order Logic propositional Used in this course (Lecture 6 on representation in FOL)

In predicate logic each atom

More expressive logic than

Constants are objects: john, applesPage 36

Predicates are properties and relations: likes(john, apples)

Functions transform objects:

likes(john, fruit of(apple tree))

Variables represent any object: likes(X, apples) Quantifiers qualify values of variables

True for all objects (Universal): apples)

X. likes(X,

Exists at least one object (Existential): X. likes(X, apples Example: FOL Sentence

Every rose has a thorn

if (X is a rose) then there exists Y Higher Order LogicPage 37

For all X

(X has Y) and (Y is a thorn)

order also objects

More expressive than first Functions and predicates are Described by predicates: binary(addition) Transformed by functions: differentiate(square) Can quantify over both

having zero at 17

E.g. define red functions as Much harder to reason with

Forward Chaining: In forward chaining the rules areexamined one after the other in a certain order. The order might be the sequence in which the rules were entered into the rule set or some other sequence as specified by the user. As each rule is examined, the expert system attempts to evaluate whether the condition is true or false. Rule evaluation: When the condition is true, the rule is fired and the next rule is examined. When the condition is false, the rule is not fired and the next rule is examined. It is possible that a rule cannot be evaluated as true or false. Perhaps the condition includes one or more variables withPage 38

unknown values. In that case the rule condition is unknown. When a rule condition is unknown, the rule is not fired and the next rule is examined. The iterative reasoning process: The process of examining one rule after the other continues until a complete pass has been made through the entire rule set. More than one pass usually is necessary in order to assign a value to the goal variable. Perhaps the information needed to evaluate one rule is produced by another rule that is examined subsequently. After the second rule is fired, the first rule can be evaluated o the next pass. The passes continue as long as it is possible to fire rules. When no more rules can be fired, the reasoning process ceases. Example of forward reasoning: Letters are used for the conditions and actions to keep the illustration simple. In rule1, for example, if condition A exists then action B is taken. Condition A might be THIS.YEAR.SALES>LAST.YEAR.SALES

Backward chaining: Backward chaining is an inferencemethod used in automated theorem proves, proof assistants and other artificial intelligence applications. It is one of the two most commonly used methods of reasoning with inference rules and logical implications the other is forward chaining. BackwardPage 39

chaining is implemented in logic programming by SLD resolution. Both rules are based on the modus ponens inference rule. Backward chaining starts with a list of goals (or a hypothesis) and works backwards from the consequent to the antecedent to see if there is data available that will support any of these consequents. An inference engine using backward chaining would search the inference rules until it finds one which has a consequent (Then clause) that matches a desired goal. If the antecedent (If clause) of that rule is not known to be true, then it is added to the list of goals (in order for one's goal to be confirmed one must also provide data that confirms this new rule). For example, suppose that the goal is to conclude the color of my pet Fritz, given that he croaks and eats flies, and that the rule base contains the following four rules:1. 2. 3. 4.

If X croaks and eats flies Then X is a frog If X chirps and sings Then X is a canary If X is a frog Then X is green If X is a canary Then X is yellow

This rule base would be searched and the third and fourth rules would be selected, because their consequents (Then Fritz is green, Then Fritz is yellow) match the goal (to determine Fritz's color). It is not yet known that Fritz is a frog, so both the antecedents (If Fritz is a frog, If Fritz is a canary) are added to the goal list. The rule base is again searched and this time thePage 40

first two rules are selected, because their consequents (Then X is a frog, Then X is a canary) match the new goals that were just added to the list. The antecedent (If Fritz croaks and eats flies) is known to be true and therefore it can be concluded that Fritz is a frog, and not a canary. The goal of determining Fritz's color is now achieved (Fritz is green if he is a frog, and yellow if he is a canary, but he is a frog since he croaks and eats flies; therefore, Fritz is green). Note that the goals always match the affirmed versions of the consequents of implications (and not the negated versions as in modus tollens) and even then, their antecedents are then considered as the new goals (and not the conclusions as in affirming the consequent) which ultimately must match known facts (usually defined as consequents whose antecedents are always true); thus, the inference rule which is used is modus ponens. Because the list of goals determines which rules are selected and used, this method is called goal-driven, in contrast to data-driven forward-chaining inference. The backward chaining approach is often employed by expert systems.

Conceptual Dependency formalism: Conceptualdependency (CD) is a theory of natural language processing which mainly deals with representation of semantics of a language. The main motivation for the development of CD as a knowledge representation techniques are given below:

Page 41

To construct computer programs that can understand natural language. To make inferences from the statements and also to identify conditions in which two sentences can have similar meaning. To provide facilities for the system to take part in dialogues and answer questions. To provide a means of representation which are language independent. Knowledge is represented in CD by elements what are called as conceptual structures. What forms the basis of CD representation is that for two sentences which have identical meaning there must be only one representation and implicitly packed information must be explicitly stated. In order that knowledge is represented in CD form, certain primitive actions have been developed. Table provides the primitives CD actions. Apart from the primitives CD actions one has to make use of the six following categories of types of objects. 1. PPs: (picture producers) Only physical objects are physical producers. 2. ACTs: Actions are done by an actor to an object. Table gives the major ACTs. Table Primitive CD formsPage 42

CD primitive action 1. ATRANS2.

Explanation transfer of abstract relationship(e,g, give) transfer of physical location of an object application of physical force of an object movement of a body part of an animal by grasping of an object by an actor (e.g.

PTRANS (e.g. go) PROPEL (e.g. throw)

3.

4. MOVE the animal 5. GRASP hold)

6. INGEST Taking of an object by an animal to the inside of that a animal (e.g. Drink.eat) 7. EXPEL Expulsion of an object from inside the body by an animal to the world (e.g. spit) 8. MTRANS Transfer of mental information between animals or within an animal (e.g. tell) 9. MBUILD Construction of a new information from an old information (e.g. decide). 10. SPEAKPage 43

Action of producing sound (e.g. say).

3. LOCs: Locations Every action takes place at some locations and serves as source and destination. 4. Ts: Times An action can take place at a particular location at a given specified time. The time can be represented on an absolute scale or relative scale. 5. AAs: Action aiders These serve as modifiers of actions, the actor PROPEL has a speed factor associated with it which is an action aider. 6. PAs: Picture Aides Serve as aides of picture producers. Every object that serve as a PP, needs certain characteristics by which they are defined. PAs practically serve PPs by defining the characteristics. There are certain rules by which the conceptual categories of types of objects discussed can be combined. CD models provide the following advantages for representing knowledge. The ACT primitives help in representing wide knowledge in a succinct way. To

Page 44

illustrate this, consider the following verbs. These are verbs that correspond to transfer of mental information. -see -learn -hear -inform -remember In CD representation all these are represented using a single ACT primitives MTRANS. They are not represented individually as given. Similarly, different verbs that indicate various activities are clubbed under unique ACT primitives, thereby reducing the number of inference rules.

The main goal of CD representation is to make explicit of what is implicit. That is why every statement that is made has not only the actors and objects but also time and location, source and destination.

The following set conceptual tenses still make usage of CD more precise. O-Object case relationship R-recipient case relationship P-pastPage 45

F-future T-transition Ts-start transition Tf-finished transition K-continuing ?-interrogative /-negative Nil-present Delta-timeless C-conditional

CD brought forward the notion of language independence because all ACTs are language-independent primitive.

Semantic Nets: The main idea behind semantic nets is that the meaning of a concept comes from the ways in which it is connected to other concepts. In a semantic net, information is represented as a set of nodes connected to each other by a set of labeled arcs, which represent relationship among the nodes. A fragment of a typical semantic net is shown in fig.Mammal

IsaPage 46 Person

Has-port

Nose

InstanceUniformColorBlue Pee-wee-Reese

teamBrooklyn- Dodgers

Fig: A semantic network This network contains an example of both the Isa and instance relations, as well as some other, more domain-specific relations like team and uniform-color. In this network, we could use inheritance to derive the additional relation Has-part (pee-wee-Reese, Nose) 1. Insertion Search. One of the early ways that semantic nets were used was to find relationships among objects by spreading activation out from each of two nodes and seeing where the activation met. This process is called insertion search. Using this process, it is possible to use the network of fig to answer questions such as what is the connection between the Brooklyn Dodgers and Blue? 2. Representing Non Binary predicates. Semantic nets are a natural way to represent relationships that would appear as ground instances of binary predicates in predicate logic. ForPage 47

example, some of the arcs from fig could be represented in logic as Isa (person, mammal) Instance (pee-Wee-Reese, Person) Team (pee-wee-Reese, Brooklyn-Dodgers) Uniform color (pee-wee- Reese, blue) But the knowledge expressed by predicates of other arties can also be expressed in semantic nets. We have already seen that many unary predicates. Such as Isa and instance. So for example, Man (Marcus) Could be rewritten as Instance (Marcus, Man) Thereby making it easy to represent in a semantic net. 3. Partitioned semantic Nets. Suppose we want to represent simple quantified expression in semantic nets. One way to do this is to partition the semantic net into a hierarchical set of spaces, each of which corresponds to the scope of one or more variables. To see how this works, consider first the simple net shown in fig. this net corresponds to the statement. The dog bit the mail carrier.Page 48

The nodes Dogs, Bite, and Mail-Carrier represents the classes of dogs, biting, and mail carriers, respectively, while the nodes d,b, and m represent a particular dog, a particular biting, and a particular mail carrier. This fact can be easily be represented by a single net with no portioning. But now suppose that we want to represent the fact Every dog has bitten a mail carrier. Or, in logic: X: dog(x) y: Mail-carrier(y) bite(x, y)

To represent this fact, it is necessary to encode the scope of the universally quantified variable x.Dogs Bite Mailcarrier

Isad b

Isa

isam

Assailant

victim

Fig: using partitioned semantic nets Frame: A frame is a collection of attributes (usually called slots) and associated values (and possibly constraints on values) that describes some entity in the world. Sometimes a framePage 49

describes an entity in some absolute sense; sometimes it represents the entity from a particular point of view. A single frame taken alone is rarely useful. Instead, we build frame systems out of collection of frames that are connected to each other. Set theory provides a good basis for understanding frame systems. Although not all frame systems are defined this way, we do so here. In this view, each frame represents either a class (a set) or an instance (an element of a class). To see how this works, consider a frame system shown in fig. In this example, the frames person, adult-male, ML-baseball-player, pitcher, and ML-baseball team are all classes. The frames pee-wee-Reese and Brooklyn-Dodgers are instances.Person Isa: Cardinality: *handed: Adult-Male Isa: Cardinality *height: person 2,000,000,000 5-10 Mammal 6,000,000,000 Right

ML-Baseball-PlayerPage 50

Isa: Cardinality: *height: *bats:

adult-male 624 6-1 equal to handed

*batting-average: .252 *team: *uniform-color: Fielder Isa: Cardinality: ML-baseball-player 376

*batting-average: .262 Pee-Wee-Reese Instance: Height: Bats: fielder 5-10 right

Batting-average: .309 Team: Uniform-color: Brooklyn-Dodgers Blue

ML-Baseball-Team Isa:Page 51

Team

Cardinality: *team-size: *manager: Brooklyn-dodgers Instance: Team-size: Manager: Players:

26 24

ML-Baseball-Team 24 Leo-Durocher (Pee-Wee-Reese)

fig. A simplified frame system

Wff. Not all strings can represent propositions of the predicatelogic. Those which produce a proposition when their symbols are interpreted must follow the rules given below, and they are called wffs (well-formed formulas) of the first order predicate logic. Rules for constructing Wffs A predicate name followed by a list of variables such as P(x, y), where P is a predicate name, and x and y are variables, is called an atomic formula. Wffs are constructed using the following rules:1.

True and False are wffs.Page 52

2. Each propositional constant (i.e. specific proposition), and each propositional variable (i.e. a variable representing propositions) are wffs. 3. Each atomic formula (i.e. a specific predicate with variables) is a wff. 4. If A, B, and C are wffs, then so are A, (A B), (A B), (A B), and (A B). 5. If x is a variable (representing objects of the universe of discourse), and A is a wff, then so are x A and x A . 6. For example, "The capital of Virginia is Richmond." is a specific proposition. Hence it is a wff by Rule 2. Let B be a predicate name representing "being blue" and let x be a variable. Then B(x) is an atomic formula meaning "x is blue". Thus it is a wff by Rule 3. above. By applying Rule 5. to B(x), xB(x) is a wff and so is xB(x). Then by applying Rule 4. to them x B(x) x B(x) is seen to be a wff. Similarly if R is a predicate name representing "being round". Then R(x) is an atomic formula. Hence it is a wff. By applying Rule 4 to B(x) and R(x), a wff B(x) R(x) is obtained. In this manner, larger and more complex wffs can be constructed following the rules given above. Note, however, that strings that can not be constructed by using those rules are not wffs. For example, xB(x)R(x), and B( x ) are NOT wffs, NOR are B( R(x) ), and B( x R(x) ) . One way to check whether or not an expression is a wff is to try to state it in English. If you can translate it into aPage 53

correct English sentence, then it is a wff. More examples: To express the fact that Tom is taller than John, we can use the atomic formula taller(Tom, John), which is a wff. This wff can also be part of some compound statement such as taller(Tom, John) taller(John, Tom), which is also a wff. If x is a variable representing people in the world, then taller(x,Tom), x taller(x,Tom), x taller(x,Tom), x y taller(x,y) are all wffs among others. However, taller( x,John) and taller(Tom Mary, Jim), for example, are NOT wffs.

7.

Unit-3 Handling Uncertainty and learning Fuzzy Logic: In the techniques we have not modified themathematical underpinnings provided by set theory and logic. We have instead augmented those ideas with additional constructs provided by probability theory. We take a different approach and briefly consider what happens if we makePage 54

fundamental changes to our idea of set membership and corresponding changes to our definitions of logical operations. The motivation for fuzzy sets is provided by the need to represent such propositions as: John is very tall. Mary is slightly ill. Sue and Linda are close friends. Exceptions to the rule are nearly impossible. Most Frenchmen are not very tall. While traditional set theory defines set membership as a Boolean predicate, fuzzy set theory allows us to represent set membership as a possibility distribution. Once set membership has been redefined in this way, it is possible to define a reasoning system based on techniques for combining distributions. Such responders have been applied control systems for devices as diverse trains and washing machines.

Dempster Shafer Theory: This theory was developed byDempster 1968; Shafer, 1976. This approach considers sets of propositions and assigns to each of them an interval [Belief, Plausibility]Page 55

In which the degree of belief must lie. Belief (usually denoted Bel) measures the strength of the evidence in favor of a set of propositions. It ranges from 0 (indicating no evidence) to 1(denoting certainty). A belief function, Bel, corresponding to a specific m for the set A, is defined as the sum of beliefs committed to every subset of A by m. That is, Bel (A) is a measure of the total support or belief committed to the set A and sets a minimum value for its likelihood. It is defined in terms of all belief assigned to A as well as to all proper subsets of A. Thus, Bel (A) =m (B) For example, if U contains the mutually exclusive subsets A, B, C, and D then Bel({A,C,D})= m({A,C,D}) +m({A,C})+m({A,D})+m({C,D}) +m({A})+m({c})+m({D}. In Dempster-Shafer theory, a belief interval can also be defined for a subset A. It is represented as the subinterval [Bel (A), P1 (A)] of [0, 1]. Bel (A) is also called the support of A, and P1 (A) =1-Bel (A) the plausibility of A. We define Bel (o) =0 to signify that no belief should be assigned to the empty set and Bel (U) = 1 to show that the truth is contained within U. The subsets A of U are called the focal elements of the support function Bel when m (A)>0.Page 56

Since Bel (A) only partially describes the beliefs about proposition A, it is useful to also have a measure of the extent one believes in A that is, the doubts regarding A. For this, we define the doubt of A as D (A) = Bel (A). From this definition it will be seen that the upper bound of the belief interval noted above, P1 (A), can be expressed as P1 (A) =1-D (A) = 1- Bel (`A). P1 (A) represents an upper belief limit on the proposition A. The belief interval, [Bel (A), P (A)], is also sometimes referred to as the confidence in A, while the quantity P1 (A)-Bel (A) is referred to as the uncertainty in A. It can be shown that P1 (0) = 0, P1 (U) =1 For all A, P1 (A) Bel (A) Bel (A) +Bel (`A) 1, P1 (A) + P1 (`A) 1, and For A _B, Bel (A) Bel (B), P1 (A) P1 (B) As an example of the above concepts, recall once again the problem of identifying the terrorist organizations A, B, C and D could have been responsible for the attack. The possible subsets of U in this case form a lattice of sixteen sub sets (fig).{A, B, C, D}Page 57

{A, B, C,}

{A, B, D}

{A, C, D}

{B, C, D}

{A, B} {A, C,} {B, C,} {B, D} {A, C,} {C, D} {B, D} {C, D}

{A}

{B}

{C}

{D}

{O} Fig Lattice of subsets of the universe U.

Assume one piece of evidence supports the belief that groups A and C were responsible to a degree of m1 ({A, C}) = 0.6 and another source of evidence disproves the belief that C was involved (and therefore supports the belief that the three organizations, A, B, and D were responsible: that is m2 ({A, B, D}) =0.7. To obtain the pooled evidence, we compute the following quantities M1 0m2({A}) = (0.6)*(0.7) =0.42 M1 0 m2 ({A, C}) = (0.6) *(0.3) = 0.18 M1 0 m2 ({A, B, D}) = (0.4)*(0.7) = 0.28Page 58

M1 0 m2 ({U}) = (0.4)*(0.3) =0.12 M1 0m2=0 for all other subsets of U Bel1 ({A, C}) = m ({A, C}) +m ({A}) +m ({C})

Bayes Theorem: An important goal for many problemsolving systems is to collect evidence as the system goes along and to modify its behavior on the basis of evidence. To model this behavior, we need a statistical theory of evidence. Bayesian statistics is such a theory. The fundamental notion of. Bayesian statistics is that of conditional probability; P (H/ E) 231 Read this expression as the probability of hypothesis H given that we have observed evidence E. To compute this, we need to take into account the prior probability of H and the extent to which E provides evidence of H. To do this, we need to define a universe that contains an exhaustive, mutually exclusive set of His among which we are trying to discriminate then, let P (Hi/E) = the probability that hypothesis Hi is true given evidence E P (E/Hi) = the probability that we will observe evidence E given that hypothesis i is true.Page 59

P (Hi) = the priori probability that hypothesis i is true in the absence of any specific evidence. These probabilities are called prior probabilities of priors. K=the number of positive hypotheses Bayes theorem then states that P (Hi/E) = P (E/Hi). P (Hi) P (E/Hn).P (Hn) Specifically, when we say P (A/B), we are describing the conditional probability of A given that the only evidence we have is B. If there is also other relevant evidence, then it too must be considered. Suppose, for example, that we are solving a medical diagnosis problem. Consider the following assertions: S: patient has spots M: patient has measles F: patient has high fever Without any additional evidence, the presence of spots serves as evidence in favor of measles. It also serves as evidence of fever since measles would cause fever. But, since spots and fever are not independent events, we cannot just sum their effects; instead, we need to represent explicitly the conditional probability that arises from their conjunction. In general, given aPage 60

prior body of evidence e and some new observation E, we need to compute. P (H/E, e) = P (H/E).P (e/E, H) P (e/E) Unfortunately, in an arbitrarily complex world, the sizes of the set of join probabilities that we are require in order to compute this function grows as 2n if there are n different propositions being considered. This makes using Bayes theorem intractable for several reasons: The knowledge acquisition problem is insurmountable; too many probabilities have to be provided. The space that would be required to store all the probabilities is too large. The time required to compute the probabilities is too large. Despite these problems, through Bayesian statistics provide an attractive basis for an uncertain reasoning system. As a result, several mechanisms for exploiting its power while at the same time making it tractable have been developed.

Learning: One of the most often heard criticisms of AI is thatmachines cannot be called intelligent until they are able to learn to do new things and to adapt to new situations, rather than simply doing as they are told to do. There can be little questionPage 61

that the ability to adapt to new surroundings and to solve new problems is an important characteristics of intelligent entities. How to interpret its inputs in such a way that its performance gradually improves. Learning denotes changes in the systems that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time. Learning covers a wide range of phenomena. 1. At one end of the spectrum is skill refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play tennis, the better you get. 2. At the other end of this spectrum lies knowledge acquisition. Knowledge is generally acquired through experience 3. Many AI programs are able to improve their performance substantially through rote- learning techniques. 4. Another way we learn is through taking advice from others. Advice taking is similar to rote learning, but high-level may not be in a form simple enough for a program to use directly in problem solving.5.

People also learn through their own problem solving experience. After solving a complex problem, wePage 62

remember the structure of the problem and the methods we used to solve it. The next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems more easily. the program remembers its experiences and generalizes from them. In large problem spaces, however, efficiency gains are critical. Learning can mean the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn through problem-solving experience may be able to come up with qualitatively better solutions in the future.6.

Another form of learning that does involve stimuli from the outside is learning from examples. Learning from examples usually involves a teacher who helps us classify things by correcting us when we are wrong. Sometimes, however, a program can discover things without the aid of a teacher. Learning is itself a problem-solving process.

Learning Model: Learning can be accomplished using anumber of different methods. For example, we can learn by memorizing facts, by being told, or by studying examples like problem solutions. Learning requires that new knowledge structures be created from some form of input stimulus. This new knowledge must then be assimilated into a knowledge base and be tested in some way for its utility. Testing means that the knowledge should be used in the performance ofPage 63

some task from which meaningful feedback can be obtained, where the feedback provides some measure of the accuracy and usefulness of the newly acquired knowledge. Learning model is depicted in fig where the environment has been included as part of the overall learner system. The environment may be regarded as either a form of nature which produces random stimuli or as a more organized training source such as a teacher which provides carefully selected training examples for the learner component.Stimuli Examples

Feedback uLearner Component

Environment Or TeacherKnowledge Base Critic performance

ResponsePerformance Component

Evaluator

TasksPage 64

Fig. Learning Model

The actual form of environment used will depend on the particular learning paradigm. In any case, some representation language must be assumed for communication between the environment and the learner. The language may be the same representation scheme as that used in the knowledge base (such as form of predicate calculus). When they are chosen to be the same we say the single representation trick is being used. This usually results in a simpler implementation since it is not necessary to transform between two or more different representations. Inputs to the learner component may be physical stimuli of some type or descriptive, symbolic training examples. The information conveyed to the learner component is used to create and modify knowledge structures in the knowledge base. When given a task, the performance component produces a response describing actions in performing the task. The critic module then evaluates this response relative to an optimal response. The cycle described above may be repeated a number of times until the performance of the system have reached some acceptable level, until a known learning goal has been reached, or until changes cease to occur in the knowledge base after some chosen numbers of training examples have been observed.Page 65

There are several important factors which influence a systems ability to learn in addition to the form of representation used. They include the types of training provided, the form and extent of any initial background knowledge, the type of feedback provided, and the learning algorithms used (fig).Background knowledge

Feedback Learning AlgorithmsTraining Scenario

resultant

Representation scheme

fig Factors affecting learning performance Finally the learning algorithms themselves determine to a large extent how successful a learning system will be. Te algorithms control the search to find and build the knowledge structures. We then expect that the algorithms that extract much of thePage 66

useful information from training examples and take advantage of any background knowledge out perform those that do not.

Supervised learning: Supervised learning is the machinelearning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete, see classification) or a regression function (if the output is continuous, see regression). The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). (Compare with unsupervised learning.) The parallel task in human and animal psychology is often referred to as concept learning. Overview In order to solve a given problem of supervised learning, one has to perform the following steps: 1. Determine the type of training examples. Before doing anything else, the engineer should decide what kind of data is to be used as an example. For instance, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.Page 67

2. Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements. 3. Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output. 4. Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees. 5. Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation. 6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.

Page 68

Factors to consider Factors to consider when choosing and applying a learning algorithm include the following:1.

2.

3.

Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data. Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often by solved by imposing some form of regularization. Presence of interactions and non-linearitys. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels)Page 69

generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them. How supervised learning algorithms work Given a set of training examples of the form , a learning algorithm a function , where X is the input space and Y is the output space. The function g is an element of some space of possible functions G, usually called the hypothesis space. It is sometimes convenient to represent g using a scoring function such that g is defined as returning the y value that gives the highest score: . Let F denote the space of scoring functions. Although G and F can be any space of functions, many learning algorithms are probabilistic models where g takes the form of a conditional probability model g(x) = P(y | x), or f takes the form of a joint probability model f(x,y) = P(x,y). For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model. There are two basic approaches to choosing f or g: empirical risk minimization and structural risk minimization[3]. Empirical risk minimization seeks the function that best fits the training data.Page 70

Structural risk minimize includes a penalty function that controls the bias/variance tradeoff. In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs, . In order to measure how well a function fits the training data, a loss function is defined. For training example , the loss of predicting the value is . The risk R(g) of function g is defined as the expected loss of g. This can be estimated from the training data as . Generalizations of supervised learning There are several ways in which the standard supervised learning problem can be generalized: Semi-supervised learning: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. Active learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.

Page 71

Structured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended. .Learning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.

Unsupervised Learning: What if a neural network is givenno feedback for its outputs, not even a real-valued reinforcement? Can the network learn anything useful? The unintuitive answer is yes.Has- hair? Has-scales? has-feathers? flies? lives in water? lays eggs?

Dog Cat Bat Whale Canary Robin Ostrich Snake Lizard

1 1 1 1 0 0 0 0 0

0 0 0 0 0 0 0 1 1

0 0 0 0 1 1 1 0 0

0 0 1 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 1 1 1 1 1

Page 72

Alligator 0

1

0

0

1

1

Fig. Data for unsupervised learning This form of learning is called unsupervised learning because no teacher is required. Given a set of input data, the network is allowed to play with it to try to discover regularities and relationships between the different parts of the input. Learning is often made possible through some notion of which features in the input sets are important. But often we do not know in advance which features are important, and asking a learning system to deal with raw input data can be computationally expensive. Unsupervised learning can be used as a feature discovery module that precedes supervised learning. Consider the data in fig. the group of ten animals, each is described by its own set of features, breaks down naturally into three groups: mammals, reptiles and birds. We would like to build a network that can learn which group a particular animal belongs to, and to generalize so that it can identify animals it has not yet seen. We can easily accomplish this with a six-input, three-output back propagation network. We simply present the network with an input, observe its output, and update its weights based on the errors it makes. Without a teacher, however, the error cannot be computed, so we must seek other methods. Our first problem is to ensure that only one of the three output units become active for any given input. One solution to this problem is to let the network settle, find the output unit with thePage 73

highest level of activation, and set that unit to 1 and all other output units to 0. In other words, the output unit with the highest activation is the only one we consider to be active. A more neural-like solution is to have the output units fight among themselves for control of an input vector.

Learning by Induction : What is "Learning by Induction"?Simply put, it is learning by watching. You watch what others do, then you do that. Below is a more formal explanation of inductive vs. deductive logic: In logic, we often refer to the two broad methods of reasoning as the deductive and inductive approaches. Deductive reasoning works from the more general to the more specific. Sometimes this is informally called a "top-down" approach. We might begin with thinking up a theory about our topic of interest. We then narrow that down into more specific hypotheses that we can test. We narrow down even further when we collect observations to address the hypotheses. This ultimately leads us to be able to test the hypotheses with specific data -- a confirmation (or not) of our original theories. Inductive reasoning works the other way, moving from specific observations to broader generalizations and theories. Informally, we sometimes call this a "bottom up" approach. In inductive reasoning, we begin with specific observations and measures, begin to detect patterns and regularities, formulate some tentative hypotheses that we can explore, and finally end up

Page 74

developing some general conclusions or theories. (Thanks to William M.K. Trochim for these definitions). To translate this into an approach to learning a skill, deductive learning is someone TELLING you what to do, while inductive learning is someone SHOWING you what to do. Remember the saying "a picture is worth a thousand words"? That means, in a given amount of time, a person can be SHOWN a thousand times more information than they could be TOLD in the same amount of time. I can access a picture or pattern much more quickly than the equivalent description of that picture or pattern in words. Athletes often practice "visualization" before they undertake an action. But in order to visualize something, you need to have a picture in your head to visualize. How do you get those pictures in your head, by WATCHING. Who do you watch? Professionals. This is the key. Pay attention here. When you want to learn a skill:

WATCH PROFESSIONALS DO IT BEFORE YOU DO IT. DO NOT DO IT YOURSELF FIRST. Going out and doing a sport without having seen AND STUDIED professionals doing that sport is THE NUMBER ONE MISTAKE people make. They force themselves to play, their brain says "what do we do now?", another part of the brain looks for examples (pictures) of what to do, and, finding none, says "just do anything". So they try to generate behavior toPage 75

accomplish something within the rules of the sport. If they "keep score" and try to "win" and avoid "losing", the negative impact is multiplied tenfold. Yet this is EXACTLY what most people do and what most ARE TOLD to do! "Interested in tennis? Grab a racquet, join a league, get out there and have fun!" Then what happens? They have no training, they try do what it takes to "win", and to do so, they manufacture awful strokes just TO BE ABLE to play (remember, they joined a league, so they have to keep score and win!), these awful strokes get ingrained by repetition, they produce terrible results, and they are very difficult to unlearn, so progress, despite lessons (mostly in the useless form of words), is slow or non existent. Then they quit. When you finally pick up a racquet and go out to play, and your brain says "what do we do now?", your head will be filled with pictures of professionals perfectly doing what you are trying to do. You will not know how to do it incorrectly, because you have never seen it done incorrectly. You will try to do what they do, and you will almost immediately proceed to an advanced intermediate level. You will be a beginner for a short period of time, if at all, and improvement will be a matter of adding to and refining what you are doing, not stripping down and unlearning bad patterns. And since you are not keeping score, you focus purely on technique. If you hit one into the net, just pull another ball out of your pocket and do it again. No big deal, no drama, no guilt. Just hit another. When you feel you can hit all of your shots somewhat professionally, maybe you can actually play someone and keep score. You will love thePage 76

positive feedback of beating players who have been playing much longer than you have. You will wonder how they could have played for so long and still "play like that". Don't they know it's done "this way?" What professional does it "that way?" Don't they watch tennis on TV? Who does that? I just started and I know that's wrong. All these thoughts will make you feel like a genius. So how does all of this relate to chess? Simply put, play over the games of professional players and see how they play before you play anybody. Try to imitate them instead of trying to reinvent the wheel. Play over the games of lots of different players and then decide which one or two you like. The ones you like are the ones where you say after playing over one of their games, "I would love to play a game like that!" Then just concentrate on those one or two players. Study and play the openings they play. Get books where they comment on their own games. Maybe they will say what they were thinking during the game. Try to play like them. During your games, think "What would he do in this position?" Personally, I like Murphy for his rapid development and attacks, Lachine for his creativeness in all positions, and Spas sky for his ability to play all types of positions and create attacks in calm positions.

Learning Decision treeLearning , Decision tree used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's targetPage 77

value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making.. General Learning decision tree is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. Data comes in records of the form:Page 78

The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize. The vector x is composed of the input variables, x1, x2, x3 etc., that are used for that task. Types: Classification tree analysis is when the predicted outcome is the class to which the data belongs. Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patients length of stay in a hospital). Classification And Regression Tree (CART) analysis is used to refer to both of the above procedures