MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...

MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE

Ataulla Fareed Pasha System Engineer Analyst Dell EMC [email protected]

2016 EMC Proven Professional Knowledge Sharing 2

Table of Contents Introduction .................................................................................................................................................. 3

What is Machine Learning? Is it same as Artificial Intelligence? .................................................................. 3

Regression ..................................................................................................................................................... 4

Residual sum of squares ........................................................................................................................... 5

Determining which regression model to use ............................................................................................ 7

Classifiers .................................................................................................................................................... 10

Linear Classifier ....................................................................................................................................... 11

Evaluating a classifier and classifier error ............................................................................................... 12

Types of Errors ........................................................................................................................................ 13

Calculating the accuracy when more than 2 classes exists..................................................................... 14

Bias of Machine learning ......................................................................................................................... 15

Probabilities ............................................................................................................................................ 16

Document Retrieval and Clustering ............................................................................................................ 17

Flaw in comparing similar word counts .................................................................................................. 18

Normalizing Vector ................................................................................................................................. 19

Term frequency and Inverse document frequency (TF and IDF) ............................................................ 20

Supervised Learning ................................................................................................................................ 22

Unsupervised Learning ........................................................................................................................... 22

Deep Learning ............................................................................................................................................. 26

Image recognition ................................................................................................................................... 29

Conclusion ................................................................................................................................................... 31

Glossary ....................................................................................................................................................... 33

References .................................................................................................................................................. 34

Disclaimer: The views, processes or methodologies published in this article are those of the authors.

They do not necessarily reflect Dell EMC’s views, processes or methodologies.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.


Introduction

"Digital technologies and the data they capture present an opportunity for organizations of all sizes to

drive business innovation and new levels of productivity," said Helena Schwenk, research manager, IDC

Europe Big Data and Analytics.

Many organizations are embracing advanced analytics as they start to realize the vast opportunities it

brings. Advanced analytics can save lives, detect frauds, detect spams, and bring in more profits. Even a

small percentage of improvements in large organizations have huge impacts.

In 2014 Advanced analytics was the fastest growing segment (12.4%) as organizations started to move

from diagnostic analytics to predictive and prescriptive analytics. This increase was thanks to increased

interest in big data, predictive analytics, and machine learning.

Advanced analytics is way ahead of analytics which only tell us whether there were losses or profits.

Advanced analytics will go through available data and identify the root cause and predict outcomes and

behaviors. There are various methods used to perform advanced analytics such as optimization,

machine learning, text, speech and image analytics, etc.

In particular, this article explores machine learning (ML) and some of the widely used ML techniques and

its practical use.

What if you are unsure of the selling price of your house? What if you are not sure of which movie to

watch or which book to read during your leisure time?

Do you want to identify a person from an image which is blurred? You see so many restaurants in town

which serve your favorite food and you have no idea which to choose. To make your decision-making

process simple and convenient, we have Machine Learning!

What is Machine Learning? Is it same as Artificial Intelligence?

The goal of Artificial intelligence (AI) is to create a machine that can think like a human. To do this, a

machine will need the ability to learn, apply reason, use abstract thinking, etc. Machine learning

however is writing software that is focused on learning from past experiences. Machine learning is more

related to data mining and statistics than AI.

Tom Mitchell from Carnegie Mellon University defines Machine learning as – "A computer program is

said to learn from experience ‘E' with respect to some class of task ‘T' and performance measure ‘P', if

its performance at task ‘T' as measured by ‘P' improves with experience ‘E' "

There are many machine learning techniques out there. We will go in-depth with some of the common

ML techniques – Regression, Classifiers, Clustering, and Deep Learning techniques.


Regression

Suppose we have a house for sale and want to know the best selling price for this house.

How do we go about it? We don't want to underestimate the property value nor do we want to sell it at

a price where people will not find it worth buying. What do we do? We could make predictions which

could be right or wrong. Fortunately we have machine Learning techniques that could help us here.

To make predictions of house value, let's first start going through a list of houses that were sold in some

time span over a given geographical location. We have many features that will contribute to the total

cost of the house like number of rooms, square footage, number of bathrooms, number of bedrooms,

number of kitchens, etc.

Let's plot the houses which were sold in the past on a graph with size of house on X axis and cost on Y

axis. Each blue dot on the graph is a house that was sold.

Figure 1

We will now represent the prediction as a line as shown in Figure 2.

Figure 2


This Prediction can be represented as a function

f(x)= W0+W1(x)

Where, W0 is intercept (A point on Y axis where the prediction line starts)

W1 is Slope or Regression coefficient. How much impact does varying square footage of the

house have on its sales price?

Prediction line (Green) represents the sales price that is predicted by the system based on the square

footage of the house.

The graph in Figure 2 shows prediction line, but how do we know which line is the best prediction line of

all the possible lines. Refer to Figure 3 showing various prediction lines.

Figure 3

To determine the best prediction line representation, we will use Residual sum of squares (RSS).

Residual sum of squares

Let's consider a prediction line as per the graph in Figure 2. Now we will plot a line starting from the

actual cost of the house to the line predicted by our model.

The line drawn (orange) represents the miss in our prediction from the actual cost of the house.

Figure 4


RSS= (Actual cost $ of house1 – [W0+W1(Sqr. feet of house1)])2 + (Actual cost $ of house2 – [W0+W1(Sqr.

feet of house2)])2 + (Actual cost $ of house3 – [W0+W1(Sqr. feet of house3)])2 +…..+(Actual cost $ of

house n – [W0+W1(Sqr. feet for house n)])2

RSS is the sum of squares of margin of miss in cost predicted of each house.

By performing residual sum of squares on all the possible predicted lines we can choose the one with

the least residual sum of squares as the one that makes the least error in prediction.

Hence, we now have our prediction model ready by using the linear function.

Figure 5

But what if we get a better prediction using a quadratic function instead of linear function?

Figure 6

Where f(x)= W0+W1(x)+ W2(x)2

By using a quadratic function, our model seems to give us fewer errors compared to a linear model.

However, let's use a more complex model representation like a 13th order polynomial function. It would

lead to a prediction system that matches house sales prices more accurately.


Figure 7

But, per this model, the cost of our house that needs to be sold is very low. We can guess that our house

sale price should be much higher than the value predicted by our 13th order polynomial. Hence, we

come to a question of which model do we use in getting the correct prediction.

This is the beauty of machine learning. Our system needs to make the decision as which model is correct

and which is less accurate.

Determining which regression model to use

Let’s consider a set of houses we know that were sold and the prices at which they were sold.

Out of the set of houses we have, we will remove a few houses and use the remaining houses to

determine which regression model is best suited.

H1, H2, H3,….H25 (Houses we have and we know their sales prices)

H5, H6, H7, H9, H11 are removed from the above set

We are left with H1, H2, H3, H4, H8, H10, H12, H13….H25 (call this as training set)

H5, H6, H7, H9, H11 which were removed we also know their prices (call this as test set)

What we will do is determine the best regression model to use using Training set and use the houses in

test set to determine how far the predicted house price is from their actual sold prices.

As explained earlier, using Training set data we will use RSS method to determine the lowest error

predicting model. Also we will try the quadratic model and higher complex models like 13th order

polynomial on the training set. Next, using the Test set, we will determine which model gives more

accurate predictions to houses in test set. This way we can come to know which model is the best fit in

our case.

Now, let's plot a graph of error made in predicting house prices over complexity of model.


Figure 8

We see that house prices are more correctly represented by higher order function.

Now we use the remaining set of houses (test set) that were removed and check if their actual prices are

the same as the predicted price or how close the price of Test set houses are compared to the actual

house.

Figure 9

We see that the test house predicted sale prices error reduced as the complexity of function increased.

This continued until the error slowly started to increase again.

This shows that it’s not always that higher order polynomial function will predict better sales prices.

Which function to use will have to be determined based on available data in our machine learning code.

This explanation of predicting sales prices of house via machine learning was just based on square

footage of the house; there are more parameters that could influence the cost of house like number of


bedrooms, number of bathrooms, type of furniture, etc. As more features are added, the model

complexity starts to increase.

This gives us an idea of how our machine learning code helps us in making house price predictions. This

can be used not only on sales price of a house but other fields like predicting stock prices (based on

recent history of stock prices, new events, etc.) or predicting the number of retweets a person can have

as well (based on total number of followers, total number of followers of followers, popularity of

hashtag, past retweets, etc.)


Classifiers

We saw how machine learning helps in predicting house price. There are other fields in life where

regression may not be suitable to perform predictions. "Classifiers" is yet another powerful method

used to make predictions in various fields like resultant reviews, medical diagnosis, spam filtering, etc.

Classifiers are perhaps one of the most commonly used techniques in machine learning.

For example, a patient has undergone a health checkup and the doctor uses symptoms, test reports, and

past records as inputs to diagnosis whether a patient is healthy or suffers from any specific health

condition. Considering these same inputs, machine learning can be used to make the same personalized

health predictions.

Figure 10

Spam filtering is another field where machine learning has been extensively used. Spam filters do a

tremendously good job identifying and removing spams from our inbox. In earlier days, spam filters

were not so effective in finding spams. Spammers were getting smarter and used many techniques to

beat the spam detection process by using different mail IDs, using numbers instead of letters, etc. But

thanks to machine learning our spam filters are able to filter the spams much more effectively by

analyzing the content of email, email IDs, IP addresses, and many other such parameters. Life is more

peaceful seeing the spams in spam folders.

In order to understand classifiers more we will make use of this method in a restaurant review system to

help us pick a good restaurant to have some tasty pancakes.

Going by the reviews for a given restaurant we can notice a mixed set of reviews; some negative, some

positive, and some neutral. Here are some of the possible reviews for a given restaurant:

"I don't think I have ever eaten better pancakes anywhere else. Service is quick"

"My wife tried the strawberry cake and it was pretty forgettable"

"Pancakes were great but ambience was dull"

Since we are only interested in having pancakes, I can ignore other food item reviews. Hence, going by

the sample review, we see that two reviews were positive for pancakes. This should help us identify that

this particular restaurant is a good place to visit, provided we only care about the pancakes.


Thus, our machine learning algorithm should be able to tell us we can go to a given restaurant or not by

means of classification. We can lay it down as:

Figure 11

Linear Classifier

How does this work? How will our system recognize a positive and a negative review?

We will list a set of words that should be present in a good review such as great, awesome, good,

wonderful, etc. and a set of words that would make the review as bad, i.e. pathetic, bad, awful, etc.

Take a given review and break down the sentence into words and look for positive and negative words.

Count the total positive words and total negative words. If total positive words are greater than negative

words, we can classify the review as positive.

Figure 12

If we consider only the experience for pancake we know that it's positive on an average. This type of

classification is fine but in reality, different words carry different weights. For example great is better

than good, awesome is better than fine, etc. We will give weights for these words and calculate the

overall weight of the sentence.

Let’s consider those five words that were in our scope of analysis and each have been given certain

weight.


Let’s say a review is like:

Pancakes were great

the food was awesome

but the service was terrible

Total score = 1.5+ 1.7-2.1=0.8

This is overall a good review.

Hence, by knowing the weight of each word we know the review is good or bad; this is linear

classification.

Evaluating a classifier and classifier error

We have developed our macine learning program using classifier model and we want to understand how

accurate our model is. If it’s inaccurate, then we want our program to adjust and make modifications in

its system.

We have a given set of sentences – sentence 1, sentence 2….sentence N. We know the reviews of these

sentences whether they are positive or negative.

Of the given data set of sentences, we will remove some sentences and call it a test set. We will call the

remaining set of sentences a training set.

Figure 13

Just like in regression model we have classified our available list of data sets into two categories Training

set and Test set. Training set will be used to build our classifier and test set will be used to evaluate. We

also know the sentence review is positive or negative and this is called an actual label. We hide the

actual label of test set. Using the training set, our model will classify the sentence as positive or

negative. It will also build up weights of the words, for example good = 1.0, awesome = 1.7, bad = -1.0,

awful = -3.3, etc.

Now we will use the test set of sentence and run it through our classifier and check what our model

predicts. We will compare the prediction with the actual label whether its positive or negative. If what

the machine predicts matches the actual label then we have correctly predicted the review. If what the

machine predicts does not match the actual label, it’s a wrong prediction. We will count the total

number of sentences correctly reviewed and sentences incorrectly reviewed. From this we can tell the


error rate of our classifier.

And similarly Accuracy will be calculated as,

Types of Errors

In our case of restaurant reviews, we have either positive prediction or negative prediction (two classes

expected – positive or negative)

In random guessing for two classes (positive/negative), we could achieve 50% accuracy just by guessing.

Hence our machine learning classifier should be able to exceed at least 50% accuracy. If the number of

classes were more than two, then for K classes we can randomly guess 1/K times correctly. In this case,

our classifier should be better than 1/K value to be called as a reliable system.

Let's see the type of errors that we could make and represent it as a matrix called a confusion matrix.

Figure 14

Suppose our system is dealing with only two classes (positive or negative). If we predict positive and the

actual label is positive then we have True Positive. If we predict negative and the actual label is negative

then we have True Negative. Similarly, if our system predicted negative and actual label is positive then

we have False Negative (FN). And if our system predicted positive and actual label is negative then we

have False Positive (FP).

Let's apply these to applications and see the accuracy.

We have spam filters and medical diagnosis and we see the impact of the errors.


In the case of spam filters if we have a false negative prediction it means that an email was not spam but

was regarded as spam. The cost of this kind of miss could be annoying. Sender will not get his response

as his email went straight to the spam folder on the receiver side.

If an email was predicted as good but it was actually spam, it may turn out to be an expensive miss to an

organization, i.e. the Security system of the organization could be compromised.

In cases of medical diagnosis, if our system makes a wrong prediction (FN) it would mean that it did not

detect a disease. False Positive (FP) would mean it detected something that was not there and lead to

unnecessary treatments which would be costly to the patient. Both cases – FN or FP – seem to have a

heavy price to pay. Which is more severe depends more on what type of diseases was not detected, side

effects of unnecessary treatment, and so forth. Hence, we notice different type of applications have

different degree of tolerance to FN and FP.

Calculating the accuracy when more than 2 classes exists

Let’s say we have 100 test cases to detect if they are healthy or suffer from cold or suffer from a more

dangerous case of flu.

Suppose that out of the 100 cases 70 are healthy, 20 have a cold, and 10 have flu. Out of this, 60 were

correctly classified as healthy, 12 with a cold, and 8 with flu.

Let's look at FN, FP cases. In row 1, 60 out of 70 were correctly predicted as healthy, 8 cases were

wrongly diagnosed as a cold, and 2 were wrongly diagnosed as having flu.

In row 2, while detecting for cold, it was difficult to diagnose 8 cases. four were mistaken as flu and the

remaining were labeled healthy.

In row 3, out of 10 flu cases, 2 were wrongly diagnosed as a cold.

Here our accuracy is "total correct cases diagnosed"/total number of cases which is 80/100=80%


Bias of Machine learning

We know that if we have less data to build our classifier, the error it makes is high. As the amount of

data increases, error decreases. If we plot it in a graph it would look as shown in Figure 14.

Figure 14

The gap between the finite classifier error and zero error is the bias. Even as the amount of data is

infinite our model will still make errors in predictions. This is the error of the classifier.

This classifier was built by analyzing single words. So a review like "Pancakes were good" and "Pancakes

were not good" would both be listed as positive as the classifier only took the word good for analysis. If

the classifier used a pair of words then it would know that there is a difference between "good" and

"not good". Assign a weight for "good" as +1 and "not good" as -1.5 and build the classifier. If we now

plot the error of this classifier to the amount of data, it would look like the graph in Figure 15. We see

that even with double word analysis there is still some bias. Having more data is good and having a more

complex model will have less error.

Figure 15


Probabilities

When we make a prediction, how confident we are in our predictions is quite important. For example, a

review like "pancakes were awesome and everything else is great", is definitely positive.

If we say "pancakes were great but service was just ok" then confidence is less.

We will represent this confidence as probability here

P(Y/X) = value

If X is the input statement, then the Y output probability =0.00 to 0.99

If confidence is high, we can say P(Y/X)=0.99 and if confidence is low we can say P(Y/X)=0.5

We can use probabilities as well to influence our judgment in deciding if we go to a given restaurant or

not.


Document Retrieval and Clustering

Suppose a person reads an article online. If the reader likes the article, the reader would like to read

more similar articles. Let’s say we get a list of suggested articles and ironically we notice that these

articles are of readers liking as well. How did we get a recommendation of articles that a reader might or

might not like? This is yet another field where machine learning has made a mark.

In a few years a reader has developed his writing skills and has written a wonderful article that gets

published online. But the author forgot to tag the article to which type of article it belongs to like sports,

politics, economics, etc. Not to worry. We can have the article tagged to appropriate groups based on its

content by a machine learning algorithm.

How does this happen? How can a machine that only understands 0’s and 1’s now understand complex

topics like economics, history, or politics. Let’s look at yet another amazing machine learning model.

We have two statements – "I love to walk in evening. Evening walk helps me burn those excess calories"

This is from one of the articles written. There are many articles that have a wide range of topics.

Any article that we have is made up of words;for our program to be intelligent enough to give

suggestions, it needs to break down the entire article into words and make a count of the number of

times the word is repeated. Now a given article can be represented by word count vector

Figure 16

A word count vector is a vector that keeps count of each word that appears in the article. Here, the

statement “evening” and “walk” appear twice. Words like “help”, “me”, “love”, “I”, “calories”, etc.

appear once. Other words like cat, group, gym, dumbbells, etc. are words in vocabulary that didn’t

appear in our article.

Now that we have broken down an article into a bag of words using word count vector, we can do this

to all the other articles in our collection. The statement used (“I love to walk in evening. Evening walk

helps me burn those excess calories”) does show it belongs to the health and fitness category. Now we

compare our word count vector with other articles in the collection. If we get similar articles, their word

count vector should have non-zero values on keywords like calories, walks, running, etc. (words which

we know would most likely be used in a health fitness article)

Thus, if we compare word count vector between two similar articles, it would be more like this:


Figure 17

If we multiply word count vectors for each word and sum it up,

Figure 18

=5x3+3x2+1x1 = 15+6+1 = 22

We get a non-zero value which signifies articles were similar

Let’s compare word count vector of health and fitness with word count vector from topic on economics

Figure 19

And just like before lets multiply the word count vectors and sum it up.

Total sum = 0 (indicating that articles were not similar)

This shows that dissimilar topics have zero-sum of multiples of word count vector and similar articles

have non-zero sum of multiples of word count vector.

Flaw in comparing similar word counts

Going by the approach explained above we notice that similar topics tend to have a non-zero result

when we compare the word count vectors. However, we know that a given article has words like “the,

this, are” etc. that are more common among all articles irrespective of similar or dissimilar articles. If we

compare dissimilar articles with these very commonly occurring words then even dissimilar topics will

seem to be similar in vector count comparison as we get non-zero results.

Thus, we need to lower the impact of these commonly occurring words and increase the impact of

important words that would define the type of article.

We have


Common words – like “this”, “that”, “the”, “us”, “are”, “hence”, “so” etc.

Rare words – like “Tiger woods”,” Michael Schumacher”, “Messi”

Important words – like “Football, “Basketball”, “Golf” etc.

We need to ignore the common words and decide upon which is more important between important

words and rare words. Both rare and important words are required to define the article or document

type. Someone reading an article on Michael Schumacher does not necessarily mean he will not like

articles on Messi and someone reading an article on football does not mean he will not like articles on

other sports. So we need to get a balance of both rare and important words to reach a conclusion.

Normalizing Vector

Now let’s say we read two articles in the sports category. We can compare the word count vector

Figure 20

Sum of their multiples will be like this

15+6=1 =22

Now if we just read a document that has nothing but document 1 written twice and we have another

article that has nothing but document 2 written twice. Comparing their word count vectors will be

Figure 21

Sum of multiples of word count vector = 60+24+4=88

However, just because Doc3 and Doc4 yielded a higher sum of multiples of each element (88 vs 22), it

does not mean Doc 3 and Doc 4 is more interesting than Doc 1 and Doc 2. Both should show the same

amount of interest factor or grade of liking. To fix this instead of summing up the multiples, we use

normalization of vectors.

Normalized word count vector for Doc 1= Square root of (12 + 52 + 32 +12) = Square root of (36) = 6

Normalized word count vector for Doc 2= Square root of (12 + 32 + 12 +22 + 12 ) = Square root of (16) = 4


We will represent the normalized vector value along with word count to give the documents an equal

footing regardless of its length

Figure 22

Sum of multiples of normalized vector word count = 5/6*3/4 + 3/6*2/4+ 1/6*1/4 =0.916

Similarly, we will calculate a normalized vector to Doc 3 and Doc 4.

Doc 3 Normalized word count vector = Sqr root of (22 + 102 + 62 +22) =12

Doc 4 Normalized word count vector = Sqr root of (12 + 62 + 22 + 42 + 22)=8

Comparing Doc 3 and Doc 4 normalized vector count,

Figure 23

Sum of multiples of normalized vector word count = 10/12*6/8 + 6/12*4/8 + 2/12*2/8 = 0.916

This way, irrespective of document length, it gets a “normalized word count vector” to make it as likable

as a document of twice the size having the same content.

Term frequency and Inverse document frequency (TF and IDF)

A given article is broken down into words and each word is counted for the number of times it appears.

We create the word count vector; this is the term frequency (TF).

Figure 24

To neutralize the impact of commonly used words like “the”, “that”, “this”, etc. in all the documents we

will use a logarithm to base 2 functionality and call it inverse document frequency.

Let’s say we have 128 documents in total and the word “the” appears in all the documents; then,


IDF of word “the” = log2 (128/ (1+128)) = log2 (approx. 1) = 0

Let’s say the word “Eminem” appears in 3 documents out of 128 document collection.

IDF of word “Eminem” = log2 (128/ (1+3)) = log2 (32) = 5

Now we have word count vector of TF and IDF

Figure 25

“The” appeared 1000 times all 128 documents; Eminem appeared 5 times in 3 documents

Multiply TF with IDF we get

Figure 26

In this way common words are down-weighted and rare words are up-weighted. So now we know to

represent the document (Bag of words) and also how we can find similar articles (using TF, IFD, and

comparing the word count vector). To suggest another article we will need to go through all the articles

or documents and compare them with our test article and find the similarity until we find the most

similar document or article. The most similar article from all the available articles can be suggested as

output or we can also suggest a set of similar articles.


Clustering

Suppose an article has been written, how do we categorize it correctly into a category it fits well within?

A sports article needs to be tagged to sports category, an entertainment article needs to be tagged to

entertainment category, and so on. As humans we can easily recognize them, but how would a machine

be able to perform the same task. Classifying documents into specific categories is the process of

clustering. By clustering, we can achieve faster search results and get other important related contents

from documents in the same cluster.

There are two ways to achieve this clustering of documents – one way is by supervised learning and

another way is unsupervised learning.

Supervised Learning

We have a set of articles that have been labeled into their respective categories.

Figure 27

Now a new article needs to be categorized into one of the above-mentioned categories. Our program

will just need to compare the TFxIDF with all the other documents and this will help us find the closest

type of article. With closest matching article we can label our new article accordingly.

What happens when we don’t have these set of labels? We will deal with this in unsupervised learning.

Unsupervised Learning

In cases where data sets are unlabeled, we want our program to keep the data in clusters.

Note: here we don’t know which data set entry belongs to which cluster.


Figure 28

Here, clustering can be done by using a very popular algorithm – K-means clustering.

K stands for the number of clusters that need to be created.

With the given data set, if we want to cluster the data sets into two clusters we begin by choosing two

cluster centroids at random locations as shown in the picture below.

Figure 29 and 30

One of the Cluster centers is labeled as Red cluster and the other is labeled as Blue cluster centroid.

Next, all the data sets closest to Red centroid are labeled as red and all the data sets closest to Blue

centroid are labeled as blue (Figure 30).

Using all the data sets of red, a mean/average of the location is calculated; similarly, mean/average of

location is calculated for blue cluster data points.


Next, the Red cluster centroid and Blue cluster centroid are made to move to the average location

calculated as shown in the Figure 31

Figure 31 and 32

Once the new cluster centroids are moved, we make all the data points closest to new Red centroid as

part of red cluster and all the data points closest to new Blue centroid as part of blue cluster (Figure 32).

Now, with the new data points being part of red and blue clusters again mean location is calculated for

each cluster. To this new mean location, centroids are moved and so on as shown in Figure 33.

Figure 33 and 34

This cycle of calculating the mean and moving of centroids goes on until new mean location calculated is

the same as the previous attempt. We reach a point when mean locations of the two clusters don’t

move any further. Now we have separated the available data sets into two clusters.

We have talked about clustering via supervised and unsupervised learning methods. Besides helping in

document retrieval, there are other fields where clustering can be useful.


For example, if you have a website that has hundreds and thousands of images, you can group these

images into clusters that contain similarities; i.e. groups for ocean pictures, animal pictures, nature

pictures, etc. This way search results on the website can list similar pictures of the same group.

Another example where clustering could be useful is monitoring seizures activities of patients. Just by

monitoring the seizures of different patients, we can classify patients with similar seizure recordings into

groups and help treat patients more effectively.


Deep Learning

When an image is given to us (humans) like the one shown below, we can quickly identify contents of

the image and say that it’s a dog in the picture or more precisely, it’s a golden retriever.

Figure seen by a Human eye Image data as 0’s and 1’s

Figure 35 and 36

To us, identifying an image seems such a trivial task; however it’s not the same case with machines

which can do calculations much faster than humans could possibly do. To a machine, this picture is just

built of 0’s and 1’s. Traditional programming is made up of a set of lines written in a machine language

which strictly tell what a computer needs to do and computers running these programs just follow the

code. This just does not help a machine to learn. When we teach a child that the above image is a dog,

the child can quickly learn it. Next time when an image of the dog is shown, the child can recollect from

its learning and identify that it’s a dog. It’s not as simple when it comes to machines.

We humans are gifted with a super computer-like brain that has evolved over millions of years. Our

brains are composed of millions of neurons that are well connected and these neurons can hold

information. Neurons communicate with other neurons by sending electrical or chemical signals via

synapses.

Figure 37


Neurons form the core of the brian and spinal cord and give us consciousness.

By understanding how our brains work, programmers have come up with a similar way of programming

called the neural networks. Just like neurons, we have nodes and these nodes trigger other nodes and so

on until we reach our desired output.

Figure 38

Input nodes trigger other nodes. Each node influences other nodes in different ways which are

represented by weights (thickness of line). When output nodes don’t have the desired output, these

weights are adjusted until desired output is obtained. Figure 38 depicts a simple single layer neural

network.

Each node triggers one or more nodes and a given node has the ability to accept an input or reject an

input. Let’s say we have a simple single layer neural network as shown below.

Figure 39


We will use colors to represent the type of value a node takes based on the input it receives. Node A

(Green) and Node B (Orange) trigger Node C with respective weights W0 and W1. Assume weight W1 has

more impact than weight W0, then Node C takes on a value that is more similar to Node B and hence

takes a value of lighter orange. Now Node C triggers output Node D and Node D has a Purple color.

However, Node D’s output which is purple is not the desired output. It should be blue. By comparing the

obtained output results and desired output results we can perform a simple subtraction and determine

that the error value was light blue.

Figure 40

After calculating the error in output, weights will be adjusted to compensate for the error and the

output is calculated again. This time, output Node D should be blue, but we notice it’s not exactly blue

as we predicted. Output Node D has a light purple value. Again a difference between desired output and

obtained output is calculated and weights are adjusted to compensate for the error. This continues until

output D is blue as we desired. In neural networks, these types of iterations don’t complete in just one

or two iterations; normally iterations happen again and again millions of times until the desired output

is obtained.

As explained earlier in our machine learning models, our available datasets are split into a training set

which is used to build our neural network while the remaining datasets are used to evaluate our neural

network. Here, we need to have huge amounts of data to be more accurate in results. As seen, in order

to obtain even simple outputs requires going through millions of iterations which means fast processing

is imperative. While the neural network has existed for more than 50 years, it’s increased in interest

over the past 10 years due to better computing and GPU developments.

When machines need to recognize simple patterns, tools like regressions should be enough. But when

task complexity increases (more and more parameters), then the neural network tends to be more

efficient. As the number of layers between input and output nodes increases, neural network becomes

deep neural network or deep learning.


Image recognition

Earlier we talked about recognizing an image; let’s look more into how deep learning helps in

recognizing an image. Consider the image shown in Figure 42.

Figure 42

In traditional programming if we could write an eye detector, a nose detector, or a mouth detector, then

running these detectors on an image will help identify if a given image is a face. However in reality,

there are no nose detectors or eye detectors or mouth detectors that we use in detecting a face.

To solve this problem of identifying an image, we will be using feature extractors codes. There are many

hand-designed feature detection codes that have been written to detect specific interest points. These

feature detection codes capture certain statistical property of the image. One such hand written feature

detection code is SIFT (Scale invariant feature transform).

Figure 43

A given image is run through the SIFT algorithm and this algorithm looks for certain key points in the

image where SIFT features are found. A vector is created to represent where the key points/SIFT


features were detected. This can be viewed similar to a bag of words vector. With the SIFT features

identified we can run this through a simple classifier to detect if the image is a face.

To explain the sift algorithm more easily pictorially, let’s examine the set of images below that we need

to detect on a test image. We are given the following:

Here is the picture that we have to look into to identify the above images

After running the images through the SIFT feature detection then it will result in


Figure 44,45,46,47,48,49

The big rectangle is the image detected and small squares are the features within an image detected

which also show the orientation in which features were detected.

After detecting the SIFT features we can track images, detect images, and identify objects whenever

needed.

Along with SIFT, we have SPIN Image, Textons, RIFT, GLOH, and HOG feature detection codes that are

used in various other fields of deep learning.

Image recognitions via deep learning can also be incorporated in other fields, i.e. online shopping. When

selecting a dress for purchase we are often provided images of similar looking dresses that might also be

of interest as well. Big online selling companies which don’t sell anything tangible are making profits

largely owing to how they present, organize, and display the data. Another field where image

recognition is helping is in recognizing and reading hand-written scripts and converting them to digital

form. Many smart phones can interpret what is written via a stylus on the phone and convert that to

digitized text.

Using deep neural network, there are apps that can read sign boards of a foreign language. These apps

can convert images having text in a foreign language which we cannot read or speak to a language that

you’re more comfortable with.

Conclusion

We just looked into some of the machine learning algorithms available. There are many other algorithms

that are making an impact in a way where machines are able to do tasks that cannot be done via

traditional coding. The machine learning algorithms explained here have the capability to make

predictions that basic analytics cannot do. These algorithms work more efficiently when the amount of

data fed increases. By keeping a validation system, machine learning codes can learn from the error it

performed by comparing to expected output and readjust itself until the outputs obtained are more

expected. As the amount of data grows at geometric proportions that means we would have to deal

with large amounts of unlabeled data. When we upload an image online it will be classified accordingly


by machine learning code which will perform image parsing and identify what image could be showing,

by comparing it with what it learned previously from already available images. For example, we have

hundreds of pictures of birds available online and when we upload a new image of a bird, the search

engine can identify similar looking images and identify that new image shows a bird (Deep learning).

Classifying the data into similar clusters helps improve our search engine results. If we search for a

keyword in any popular search engine today, we not only get the output that is required but also

anything that’s similar to it in other fields. For example, a search for keyword ‘storage’ would result in

search engines not only showing results of computer storage drives, but also other storage-related

outputs that could be what the user is looking for, i.e. furniture storage, container storage, food storage,

etc. We never specified what we were looking using just keyword ‘storage’. What helped us here was

machine learning techniques like clustering, classifying, etc.

Imagine a future where doctors were unavailable due to some reason to diagnose our conditions;

instead, we had our phones or other handheld portable devices that use our symptoms and help us

identify if we are suffering from any medical condition. The idea of a machine telling us our medical

condition sounds a bit unrealistic now. This is a general reaction but with more and more data and

better machine learning models, errors made in prediction are being reduced drastically. Competitions

like Kaggle’s second annual data science bowl for Year 2016 – “Transforming How We Diagnose Heart

Disease” – is a shining example of what is helping us achieve machine-based diagnosis.

It’s safe to say, thanks to machine learning we have a better future ahead.


Glossary

AI - Artificial intelligence

FN - False Negative

FP - False Positive

IDF - Inverse Document Frequency

ML - Machine Learning

Sqr - Square function

TF - Term Frequency


References

Machine Learning a practical introduction

http://www.infoworld.com/article/3010401/big-data/machine-learning-a-practical-introduction.html

Practical Machine Learning Problems

http://machinelearningmastery.com/practical-machine-learning-problems/

Machine Learning

http://www.sas.com/en_us/insights/analytics/machine-learning.html

SIFT

http://aishack.in/tutorials/sift-scale-invariant-feature-transform-introduction/

From feature descriptors to deep learning: 20 years of computer vision

http://www.computervisionblog.com/2015/01/from-feature-descriptors-to-deep.html

Deep Learning

https://en.wikipedia.org/wiki/Deep_learning

Advancing Business with Advanced Analytics

https://www.gartner.com/doc/3090420?ref=SiteSearch&sthkw=machine%20learning&fnl=search&srcId

=1-3478922254

Deep Learning

http://neuralnetworksanddeeplearning.com/chap6.html

Machine learning foundation

https://www.coursera.org

Kaggle's second annual data science bowl

https://www.kaggle.com/c/second-annual-data-science-bowl

http://www.infoworld.com/article/3010401/big-data/machine-learning-a-practical-introduction.html

http://machinelearningmastery.com/practical-machine-learning-problems/

http://www.sas.com/en_us/insights/analytics/machine-learning.html

http://aishack.in/tutorials/sift-scale-invariant-feature-transform-introduction/

http://www.computervisionblog.com/2015/01/from-feature-descriptors-to-deep.html

https://en.wikipedia.org/wiki/Deep_learning

https://www.gartner.com/doc/3090420?ref=SiteSearch&sthkw=machine%20learning&fnl=search&srcId=1-3478922254

https://www.gartner.com/doc/3090420?ref=SiteSearch&sthkw=machine%20learning&fnl=search&srcId=1-3478922254

http://neuralnetworksanddeeplearning.com/chap6.html

https://www.coursera.org/

https://www.kaggle.com/c/second-annual-data-science-bowl


Dell EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO

RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS

PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS

FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an

applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...

Documents

Transcript of MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE...