Mediarobotics III: Collective Intelligence › UB › MRIII › lecture_notes ›...

Mediarobotics III: Collective Intelligence

a gentle introduction to data analysis for media arts practices

topics

I applied statisticsII applied machine learning (for collective intelligence)III text processingIV genetic programmingV neural networks

Script, version 1.4 fall 2010

BIG proposal for the AUDI Urban Future Award 2010 – full of (imagined) machine learning…http://www.archdaily.com/77103/bigs-proposal-for-the-audi-urban-future-award/

Machine Learning Introduction:

Overview - imagine this simple problem:

You have a microphone in a room and want to use data from the microphone to find out whether there are people in the room or not. For simplicity sake lets assume you have only loudness (amplitude) and no frequency data. Here is a simulated data stream from the microphone.

amplitude

time

Machine learning can ‘read’ such a data stream for potential ‘meaning ‘.


amplitude

time

You might try to check for the most obvious first: peak values

something happened here….

amplitude

time

but what about here and here…


Generalizing :

How do you differentiate the (two) states you seek (presence – non presence) given the particularities of the input data?

Here this translates into a classification problem that can be addressed by several methods.

Machine learning seeks to use existing data to guess the meaning of new data. Here, this would mean classifying several examples of data related to -people in the room- and then letting the system search for similar patterns in the new data which would then be labeled as -people in the room- again. Ideally, such a system will check its assumptions periodically and change its rules.

Machine learning:

Machine Learning (ML) attempts to build theoretical and practical frameworks for synthetic systems that improve with experience. In particular, ML attempts to define relationships between types of tasks, desired performances and necessary experience.

ML grew out of the field of Artificial Intelligence (AI), but is less concerned with symbolic logic and more interested in interaction with the 'real' world, where a fixed algorithm might not be available. A focus of ML research is the production of models and patterns from data, as in improving performance over time based on input from sensors or queries of databases. For this reason, ML is closely related to data mining, inductive reasoning and pattern recognition.

Supervised Learning:

ML, as opposed to AI, bases its operations on data. This data, be it from sensors, animals or people, is used to train a synthetic system, often with a reward funtion, to act (or reason) like the source that produced the data. The diligent collection and expert guided classification of data becomes central to the learning process. Because of this training with pre-classified data, this kind of machine learning is often refered to as supervised learning. After exposure to labeled training data, a computer is typically confronted with new unlabeled data and responds to it based on the experience gained from the labeled data. It is often necessary to correct the computer when it reacts erroneously to the new data. By rewarding the system for correct responses one can improve the learning process on some types of problems. Systems that use reward functions (as opposed to labled data) are refered to as reinforcement learning systems.

Typical problem domains of supervised learning are: classification and regression

Unsupervised Learning:

Altering behavior through training and reward constitutes learning in the synthetic system and is, in many ways, similar to the way human beings learn, although the details are very different. Importantly, humans have a vastly more complex -and sometimes altruistic- conception of reward than any machine can. Also, human beings are able to learn to learn.

In ML, the ability to learn in unstructured data environments is called unsupervised learning. Here, computers typically learn to cluster data into different groups or patterns depending on the kinds of features that can be determined. There is, however, little knowledge or understanding on the part of the computer on what these features mean.

Details of implementing ML are dependent on the choice of method. Choice of method in turn is dependent on the application domain and the data collected from it. Statistical methods, neural networks of various topologies probabilistic reasoning, fuzzy logic and case based reasoning, often in combination, are common tools.

ML has produced fundamental statistical-computational theories of learning processes, has designed learning algorithms that are routinely used in commercial systems from speech recognition to computer vision.

Current research trends in ML, as discussed by Tom Mitchell in “The Discipline of Machine Learning” listed below include synergies between ML and human learning, where social and cultural constraints carry agency in addition to the biological substructures of learning, and the question of never-ending learning that continuously and indefinitively improves performance and maybe begins to question its very premise over time.

Additional Introductory Texts:

Mitchell, T.: The Discipline of Machine Learning, CMU-ML-06_108, July 2006.

Bishop, C.: A New Framework for Machine Learning, in: Computational Intelligence: Research Frontiers, pp. 1-24, Springer-Verlag, Berlin, 2008.

Alpaydin, E., Introduction to Machine Learning (Second Edition), MIT Press, 2010

Methods of ML:

Unsupervised Learning -

Data Clustering: discovering and visualizing groups of related items

Self-Organizing Maps: neural nets with neighourhood functions to find low level views of high dimensional data

Supervised Learning -

Neural nets: connectivist network that stores information in its nodes and weighted node connections.

Baysian nets/filtering: probabilistic graphical model of properties of random variables

Support vector machines: a set of classification methods that produce hyperplanes maximumly separating data clusters

Methods of ML:


Data Clustering:

Hierarchical Clustering:

This method builds up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item. In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group.

Methods of ML:


Data Clustering:

Hierarchical Clustering:

After the clustering has been achieved one typically visualizes the result in tree-like structure, a dendogram as it retains the nodes and the node relationships:

This is computationally expensive (aka slow) because the relationship between every pair of items must be calculated and recalculated as the items are merged.

Methods of ML:


Data Clustering:

K-means Clustering:

This method with k randomly placed centroids, the assumed centers of clusters, and assigns each item to the nearest centroid. After that, the centroids are moved to the average location of all the nodes assigned to them, and the process is repeated (until the centroids stop changing).

c 1

c 2

Methods of ML:


Data Clustering:

K-means Clustering:

This method with k randomly placed centroids, the assumed centers of clusters, and assigns each item to the nearest centroid. After that, the centroids are moved to the average location of all the nodes assigned to them, and the process is repeated (until the centroids stop changing).

K-means Clustering:

def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]

# Create k randomly placed centroids clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j)

# If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches

# Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches

Methods of ML:


Data Clustering:

2 Dimensional Scaling:

This method takes the difference between pairs of items and charts the items in accordance to the distances between them. Typically, the Pearson correlation is used to calculate the distance. The current distance is calculated using the sum of differences of squares.

Methods of ML:


Data Clustering:

2 Dimensional Scaling:

For each pair of items, the target distance is compared with the current distance and the error (difference) is calculated. On each iteration of the algorithm, each items is nudged a bit (in proportion to the error between them). Each node is moved according to the combination of all the other nodes pushing and pulling on it. On each iteration, the overall distance (between all node points and the targets) decreases. The algorithm terminates when this distance no longer changes.

0.5

0.4

0.3

0.4 0.7

0.3

0.6

0.2 0.9

target state current state change

blognames,words,data=clusters.readfile('blogdata.txt')coords=clusters.scaledown(data)clusters.draw2d(coords,blognames,jpeg='blogs2d.jpg')

Methods of ML:


Neural Networks – see special class notes for details

Application: getting urls from search engine text

Neural network topology and connections for relating urls to typed search terms

Trained network showing relationship between two words and the possible urls

Methods of ML:


Neural Networks – see special class notes for details

Application: getting urls from search engine text

The power of the back propagation neural network resides in the fact that it can make reasonable guesses about results for queries it has never seen before, based on their similarity to other queries (in other words: it can generalize).

The example in the O’Reilly text has another interesting practical feature. One can load a subset of the data into the network and train only on it (as opposed to using all the data all the time). Furthermore, this code example finds the appropriate number of hidden nodes from the user input data.

Topics in ML:

Optimization

Optimization techniques are typically used in problems that have many possible solutions across many variables, and that have outcomes that can change greatly depending on the combinations of these variables.

Optimization finds the best solution to a problem by trying many different solutions and scoring them to determine their quality. Optimization is typically used in cases where there are too many possible solutions to try them all.

There are many methods available for Optimization, including some that are common to machine learning (such as neural nets and genetic algorithms).

ST5-4W-03 antenna design optimized by a genetic algorithm.

http://ti.arc.nasa.gov/projects/esg/research/antenna.htm

Topics in ML:

Cost Function:

The cost function is the key to solving any problem using optimization, and it’s usually the most difficult thing to determine. The goal of any optimization algorithm is to find a set of inputs that minimizes the cost function, so the cost function has to return a value that represents how bad a solution is.

There is no particular scale for badness; the only requirement is that the function returns larger values for worse solutions. Often it is difficult to determine what makes a solution good or bad across many variables.

Any time one is faced with finding the best solution to a complicated problem, one needs to decide what the important factors are. After choosing some variables that represent those factors and impose costs, one needs to determine how to combine them into a single number.

Cost functions assume that all things that matter (to the evaluation) can be represented numerically. This is not always the case, particularly in cultural uses of information. Caveat.

Topics in ML:

Random Search

Random searching isn’t a very good optimization method, but it makes it easy to understand exactly what all the algorithms are trying to do, and it also serves as a baseline so one can see if the other algorithms are performing well. However, randomly trying different solutions is very inefficient because it does not take advantage of the good solutions that have already been discovered.

Hill Climbing

An alternate method of random searching is called hill climbing. Hill climbing startswith a random solution and looks at the set of neighboring solutions for those thatare better (have a lower cost function).

Schematic of hill climbing approach

Topics in ML:

Hill Climbing, continued

There is one major drawback to hill climbing. Moving down the slope will not necessarily lead to the best solution overall. The final solution will be a local minimum, a solution better than those around it but not the best overall. The best overall is called the global minimum, which is what optimization algorithms are ultimately supposed tofind.

One approach to this dilemma is called random restart hill climbing where the hill climbing algorithm is run several times with random starting points in the hope that one of them will be close to the global minimum.

Schematic of local vs. global minima

Topics in ML:

Simulated Annealing

Simulated annealing is an optimization method inspired by physics (thermodynamics in particular). Annealing is the process of heating up an alloy and then cooling it down slowly. Because the atoms are first made to jump around a lot and then gradually settle into a low energy state, the atoms can find a low energy configuration.

The algorithm version of annealing begins with a random solution to the problem. It uses a variable representing the temperature, which starts very high and gradually gets lower. In each iteration, one of the numbers in the solution is randomly chosen and changed in a certain direction. At the onset of the algorithm, the temperature variable has a strong influence of the overall behaviour. Over time the influence of temperature (as it 'cools' down) is reduced.

Topics in ML:

Simulated Annealing, continued

As opposed to the hill climbing approach where the new solution must always be better (lower) than the current one, simulated annealing allows a new solution with higher cost to be included with a certain probability. This reduces the risk of falling into a local minimum trap. Sometimes it is necessary to move to a (temporarily) worse solution in order to move closer to the overall best solution. Here is an example of how this can be formulated:

p = e{(-highcost-lowcost) / temperature}

Which is a simplified form of the Boltzmann distribution:

p = e{-(E_{i+1} - E_i)/(kT)}

Since the temperature (the willingness to accept a worse solution) starts very high, the exponent will always be close to 0, so the probability will almost be 1. As the temperature decreases, the difference between the high cost and the low cost becomes more important—a bigger difference leads to a lower probability, so the algorithm will favor only slightly worse solutions over much worse ones.

Topics in ML:

Simulated Annealing, continued

In order to apply the Simulated Annealing method to a specific problem, one must specify the following parameters:

the energy (goal) function E(), the candidate generator procedure neighbour(), the acceptance probability function P(), the annealing schedule temp().

These choices can have a significant impact on the method's effectiveness. Unfortunately, there are no choices of these parameters that will be good for all problems. Experimenting with variations of the general approach, including variations of the temperature decay (the annealing schedule) as well as the option to restart the process, is required.

Check this visualization to see an example:http://www.heatonresearch.com/articles/64/page1.html

http://www.heatonresearch.com/articles/64/page1.html

Topics in ML:

Newest developments 2010: Never Ending Learning (NELL)

Goal: To build a never-ending machine learning system that acquires the ability to extract structured information from unstructured web pages.

The inputs to NELL include (1) an initial ontology defining hundreds of categories that NELL is expected to read about, and (2) 10 to 15 seed examples of each category and relation.

Given these inputs, plus a collection of 500 million web pages, NELL runs 24 hours per day, continuously, to perform two ongoing tasks:

- Extract new instances of categories and relations. - Learn to read better than yesterday. NELL uses a variety of methods to extract beliefs from the web. These are retrained, using the growing knowledge base as a self-supervised collection of training examples.

Further reading: http://rtw.ml.cmu.edu/rtw/publications

http://rtw.ml.cmu.edu/rtw/publications

Additional notes

The following notes are supplements to the text “Collective Intelligence” (O'Reilly Series).

Chapter 2 discusses ways of finding similarities between data collected from people. This is a hard problem, because people's preferences are often difficult to determine and more difficult to compare.

In order to make the comparison of preferences amendable to computation, one has to map them onto numeric values – in some meaningful way.

Machine learning often uses the idea of a similarity score to compare values for degrees of 'sameness'.

But first some comments on pythonic data types that are important for the programming examples we will be using:

Tuple:

The tuple is a sequence datatype (as are strings, unicode, lists and buffers). A tuple consists of a number of values separated by commas. Tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly; they may be input with or without surrounding parentheses, although often parentheses are necessary anyway (if the tuple is part of a larger expression).

>>> t = 12345, 54321, 'hello!'>>> t[0]12345>>> t(12345, 54321, 'hello!')>>> # Tuples may be nested:... u = t, (1, 2, 3, 4, 5)>>> u((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))

List:

Python knows a number of compound data types, used to group together other values. The most versatile is the list, which can be written as a list of comma-separated values (items) between square brackets. List items need not all have the same type. Unlike strings, the elements inside a list can change. Here are some examples of working with the list type:

>>> a = [66.25, 333, 333, 1, 1234.5]>>> print a.count(333), a.count(66.25), a.count('x')2 1 0>>> a.insert(2, -1)>>> a.append(333)>>> a[66.25, 333, -1, 333, 1, 1234.5, 333]>>> a.index(333)1>>> a.remove(333)>>> a[66.25, -1, 333, 1, 1234.5, 333]>>> a.reverse()>>> a[333, 1234.5, 1, 333, -1, 66.25]>>> a.sort()>>> a[-1, 1, 66.25, 333, 333, 1234.5]>>a.pop(0)-1>>a[1, 66.25, 333, 333, 1234.5]

List comprehensions:

List comprehensions provide a concise way to create lists. Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it. If the expression would evaluate to a tuple, it must be parenthesized.

>>> vec = [2, 4, 6]>>> [3*x for x in vec][6, 12, 18]>>> [3*x for x in vec if x > 3][12, 18]>>> [3*x for x in vec if x < 2][]>>> [[x,x**2] for x in vec][[2, 4], [4, 16], [6, 36]]

Set:

A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

>>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']>>> fruit = set(basket) # create a set without duplicates>>> fruitset(['orange', 'pear', 'apple', 'banana'])>>> 'orange' in fruit # fast membership testingTrue>>> 'crabgrass' in fruitFalse

Dictionary:

Dictionaries are python's version of ``associative memories'' or ``associative arrays''. Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples.

A dictionary can be interpreted as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. Placing a comma-separated list of key:value pairs within the braces adds initial key:value pairs to the dictionary; this is also the way dictionaries are written on output.

The main operations on a dictionary are storing a value with some key and extracting the value given the key. Here is an example:

>>> tel = {'jack': 4098, 'sape': 4139}>>> tel['guido'] = 4127>>> del tel['sape']>>> tel['irv'] = 4127>>> tel{'guido': 4127, 'irv': 4127, 'jack': 4098}>>> tel.keys()['guido', 'irv', 'jack']>>> tel.has_key('guido')True


Data Clustering: discovering and visualizing groups of related items.

In order to start, one usually prepares the data by building a data structure containing representative or desirable features.

Here is an example of blogs and frequency of word usage:

Collective Intelligence, p30.

By clustering blogs based on word frequencies, it might be possible to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles.


Data Clustering:

Word Counts:

Just count the number of distinct words. The counting is easy, but one has to process the text stream (remove white spaces, html tags, etc. Typically one parses the stream into parts by using the 'regular expression' approach. Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Python has several powerful regular expression functions that can be combined to get individual words from a stream of text:

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower( ) for word in words if word!='']


Data Clustering:

Word Counts:

# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):

# Parse the feedd=feedparser.parse(url)wc={}

# Loop over all the entriesfor e in d.entries:

if 'summary' in e: summary=e.summaryelse: summary=e.description

# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1return d.feed.title,wc

Similarity scores

Euclidean Distance

This measure is simply the Euclidean geometric distance between two data points based on their x and y coordinate values (where coordinate values are appropriately chosen (). For example, the Euclidean distance between two points A(xa, ya) and B(xb, yb) can be expressed as:

xa− xb2 ya− y b2

This metric delivers a result that is smaller for higher degrees of similarity. Dividing by one inverts it (higher similarity -> larger value):

1/ xa−xb2 ya− yb2

A small change prevents numeric singularities (division by zero), and scales the results to the range {0,1}:

1/1 xa−xb2 ya− y b2

Similarity scores

Euclidean Distance (page 11 of the text has a function that implements this):

from math import sqrtdef sim_distance(prefs,person1,person2):

# Get the list of shared_itemssi={}for item in prefs[person1]:

if item in prefs[person2]:si[item]=1

# if they have no ratings in common, return 0if len(si)==0: return 0

# Add up the squares of all the differencessum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)

for item in prefs[person1] if item in prefs[person2]])

return 1/(1+sum_of_squares)

Similarity scores

Pearson Correlation Score

The linear Pearson Correlation Score indicates how well two sets of data fit on a straight line. The Pearson Correlation values range from {0,1}, where 1 means perfect match (identical results) and 0 means no similarity or correlation between the data sets. As opposed to the Euclidean Distance, the Pearson Correlation corrects for offsets (biases).

One common formula for the Pearson Correlation is:

r=∑ xi y i−

∑ x i∑ y iN

d

d=∑ xi2−∑ x i

2

N∑ yi

2−∑ y i2

N

Similarity scores

Pearson Correlation (page 13 of the text has a function that implements this

def sim_pearson(prefs,p1,p2):# Get the list of mutually rated itemssi={}for item in prefs[p1]:

if item in prefs[p2]: si[item]=1# Find the number of elementsn=len(si)

# if they are no ratings in common return -1 for pearsonif n==0: return -1 #not 0...

# Add up all the preferencessum1=sum([prefs[p1][it] for it in si])sum2=sum([prefs[p2][it] for it in si])

# Sum up the squaressum1Sq=sum([pow(prefs[p1][it],2) for it in si])sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

# Sum up the productspSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

# Calculate Pearson scorenum=pSum-(sum1*sum2/n)den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))if den==0: return 0r=num/denreturn r

From Similarity to Recommendation

As mentioned above, recommendations are HARD. Try making a good recommendation -

However, one can approximate a recommendation in a number of ways. In the text on page 15 one possibility is shown: multiplying similarity (between two people) with the score of something (based on one of them):

Page 16 of the text has a python implementation of this recommendation approximator.

This general method can also be applied to find similarities between, say, products as suggested on page 18 of the text.

Recommendation AB=similarity A ,B ∗score X , A

Mediarobotics III: Collective Intelligence › UB › MRIII › lecture_notes ›...

Documents

Transcript of Mediarobotics III: Collective Intelligence › UB › MRIII › lecture_notes ›...