Post on 02-Jun-2020
MACHINE LEARNING WORKING TOWARDS A BETTER FUTURE
Ataulla Fareed Pasha System Engineer Analyst Dell EMC ataulla.pasha@dell.com
2016 EMC Proven Professional Knowledge Sharing 2
Table of Contents Introduction .................................................................................................................................................. 3
What is Machine Learning? Is it same as Artificial Intelligence? .................................................................. 3
Regression ..................................................................................................................................................... 4
Residual sum of squares ........................................................................................................................... 5
Determining which regression model to use ............................................................................................ 7
Classifiers .................................................................................................................................................... 10
Linear Classifier ....................................................................................................................................... 11
Evaluating a classifier and classifier error ............................................................................................... 12
Types of Errors ........................................................................................................................................ 13
Calculating the accuracy when more than 2 classes exists..................................................................... 14
Bias of Machine learning ......................................................................................................................... 15
Probabilities ............................................................................................................................................ 16
Document Retrieval and Clustering ............................................................................................................ 17
Flaw in comparing similar word counts .................................................................................................. 18
Normalizing Vector ................................................................................................................................. 19
Term frequency and Inverse document frequency (TF and IDF) ............................................................ 20
Supervised Learning ................................................................................................................................ 22
Unsupervised Learning ........................................................................................................................... 22
Deep Learning ............................................................................................................................................. 26
Image recognition ................................................................................................................................... 29
Conclusion ................................................................................................................................................... 31
Glossary ....................................................................................................................................................... 33
References .................................................................................................................................................. 34
Disclaimer: The views, processes or methodologies published in this article are those of the authors.
They do not necessarily reflect Dell EMC’s views, processes or methodologies.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.
2016 EMC Proven Professional Knowledge Sharing 3
Introduction
"Digital technologies and the data they capture present an opportunity for organizations of all sizes to
drive business innovation and new levels of productivity," said Helena Schwenk, research manager, IDC
Europe Big Data and Analytics.
Many organizations are embracing advanced analytics as they start to realize the vast opportunities it
brings. Advanced analytics can save lives, detect frauds, detect spams, and bring in more profits. Even a
small percentage of improvements in large organizations have huge impacts.
In 2014 Advanced analytics was the fastest growing segment (12.4%) as organizations started to move
from diagnostic analytics to predictive and prescriptive analytics. This increase was thanks to increased
interest in big data, predictive analytics, and machine learning.
Advanced analytics is way ahead of analytics which only tell us whether there were losses or profits.
Advanced analytics will go through available data and identify the root cause and predict outcomes and
behaviors. There are various methods used to perform advanced analytics such as optimization,
machine learning, text, speech and image analytics, etc.
In particular, this article explores machine learning (ML) and some of the widely used ML techniques and
its practical use.
What if you are unsure of the selling price of your house? What if you are not sure of which movie to
watch or which book to read during your leisure time?
Do you want to identify a person from an image which is blurred? You see so many restaurants in town
which serve your favorite food and you have no idea which to choose. To make your decision-making
process simple and convenient, we have Machine Learning!
What is Machine Learning? Is it same as Artificial Intelligence?
The goal of Artificial intelligence (AI) is to create a machine that can think like a human. To do this, a
machine will need the ability to learn, apply reason, use abstract thinking, etc. Machine learning
however is writing software that is focused on learning from past experiences. Machine learning is more
related to data mining and statistics than AI.
Tom Mitchell from Carnegie Mellon University defines Machine learning as – "A computer program is
said to learn from experience ‘E' with respect to some class of task ‘T' and performance measure ‘P', if
its performance at task ‘T' as measured by ‘P' improves with experience ‘E' "
There are many machine learning techniques out there. We will go in-depth with some of the common
ML techniques – Regression, Classifiers, Clustering, and Deep Learning techniques.
2016 EMC Proven Professional Knowledge Sharing 4
Regression
Suppose we have a house for sale and want to know the best selling price for this house.
How do we go about it? We don't want to underestimate the property value nor do we want to sell it at
a price where people will not find it worth buying. What do we do? We could make predictions which
could be right or wrong. Fortunately we have machine Learning techniques that could help us here.
To make predictions of house value, let's first start going through a list of houses that were sold in some
time span over a given geographical location. We have many features that will contribute to the total
cost of the house like number of rooms, square footage, number of bathrooms, number of bedrooms,
number of kitchens, etc.
Let's plot the houses which were sold in the past on a graph with size of house on X axis and cost on Y
axis. Each blue dot on the graph is a house that was sold.
Figure 1
We will now represent the prediction as a line as shown in Figure 2.
Figure 2
2016 EMC Proven Professional Knowledge Sharing 5
This Prediction can be represented as a function
f(x)= W0+W1(x)
Where, W0 is intercept (A point on Y axis where the prediction line starts)
W1 is Slope or Regression coefficient. How much impact does varying square footage of the
house have on its sales price?
Prediction line (Green) represents the sales price that is predicted by the system based on the square
footage of the house.
The graph in Figure 2 shows prediction line, but how do we know which line is the best prediction line of
all the possible lines. Refer to Figure 3 showing various prediction lines.
Figure 3
To determine the best prediction line representation, we will use Residual sum of squares (RSS).
Residual sum of squares
Let's consider a prediction line as per the graph in Figure 2. Now we will plot a line starting from the
actual cost of the house to the line predicted by our model.
The line drawn (orange) represents the miss in our prediction from the actual cost of the house.
Figure 4
2016 EMC Proven Professional Knowledge Sharing 6
RSS= (Actual cost $ of house1 – [W0+W1(Sqr. feet of house1)])2 + (Actual cost $ of house2 – [W0+W1(Sqr.
feet of house2)])2 + (Actual cost $ of house3 – [W0+W1(Sqr. feet of house3)])2 +…..+(Actual cost $ of
house n – [W0+W1(Sqr. feet for house n)])2
RSS is the sum of squares of margin of miss in cost predicted of each house.
By performing residual sum of squares on all the possible predicted lines we can choose the one with
the least residual sum of squares as the one that makes the least error in prediction.
Hence, we now have our prediction model ready by using the linear function.
Figure 5
But what if we get a better prediction using a quadratic function instead of linear function?
Figure 6
Where f(x)= W0+W1(x)+ W2(x)2
By using a quadratic function, our model seems to give us fewer errors compared to a linear model.
However, let's use a more complex model representation like a 13th order polynomial function. It would
lead to a prediction system that matches house sales prices more accurately.
2016 EMC Proven Professional Knowledge Sharing 7
Figure 7
But, per this model, the cost of our house that needs to be sold is very low. We can guess that our house
sale price should be much higher than the value predicted by our 13th order polynomial. Hence, we
come to a question of which model do we use in getting the correct prediction.
This is the beauty of machine learning. Our system needs to make the decision as which model is correct
and which is less accurate.
Determining which regression model to use
Let’s consider a set of houses we know that were sold and the prices at which they were sold.
Out of the set of houses we have, we will remove a few houses and use the remaining houses to
determine which regression model is best suited.
H1, H2, H3,….H25 (Houses we have and we know their sales prices)
H5, H6, H7, H9, H11 are removed from the above set
We are left with H1, H2, H3, H4, H8, H10, H12, H13….H25 (call this as training set)
H5, H6, H7, H9, H11 which were removed we also know their prices (call this as test set)
What we will do is determine the best regression model to use using Training set and use the houses in
test set to determine how far the predicted house price is from their actual sold prices.
As explained earlier, using Training set data we will use RSS method to determine the lowest error
predicting model. Also we will try the quadratic model and higher complex models like 13th order
polynomial on the training set. Next, using the Test set, we will determine which model gives more
accurate predictions to houses in test set. This way we can come to know which model is the best fit in
our case.
Now, let's plot a graph of error made in predicting house prices over complexity of model.
2016 EMC Proven Professional Knowledge Sharing 8
Figure 8
We see that house prices are more correctly represented by higher order function.
Now we use the remaining set of houses (test set) that were removed and check if their actual prices are
the same as the predicted price or how close the price of Test set houses are compared to the actual
house.
Figure 9
We see that the test house predicted sale prices error reduced as the complexity of function increased.
This continued until the error slowly started to increase again.
This shows that it’s not always that higher order polynomial function will predict better sales prices.
Which function to use will have to be determined based on available data in our machine learning code.
This explanation of predicting sales prices of house via machine learning was just based on square
footage of the house; there are more parameters that could influence the cost of house like number of
2016 EMC Proven Professional Knowledge Sharing 9
bedrooms, number of bathrooms, type of furniture, etc. As more features are added, the model
complexity starts to increase.
This gives us an idea of how our machine learning code helps us in making house price predictions. This
can be used not only on sales price of a house but other fields like predicting stock prices (based on
recent history of stock prices, new events, etc.) or predicting the number of retweets a person can have
as well (based on total number of followers, total number of followers of followers, popularity of
hashtag, past retweets, etc.)
2016 EMC Proven Professional Knowledge Sharing 10
Classifiers
We saw how machine learning helps in predicting house price. There are other fields in life where
regression may not be suitable to perform predictions. "Classifiers" is yet another powerful method
used to make predictions in various fields like resultant reviews, medical diagnosis, spam filtering, etc.
Classifiers are perhaps one of the most commonly used techniques in machine learning.
For example, a patient has undergone a health checkup and the doctor uses symptoms, test reports, and
past records as inputs to diagnosis whether a patient is healthy or suffers from any specific health
condition. Considering these same inputs, machine learning can be used to make the same personalized
health predictions.
Figure 10
Spam filtering is another field where machine learning has been extensively used. Spam filters do a
tremendously good job identifying and removing spams from our inbox. In earlier days, spam filters
were not so effective in finding spams. Spammers were getting smarter and used many techniques to
beat the spam detection process by using different mail IDs, using numbers instead of letters, etc. But
thanks to machine learning our spam filters are able to filter the spams much more effectively by
analyzing the content of email, email IDs, IP addresses, and many other such parameters. Life is more
peaceful seeing the spams in spam folders.
In order to understand classifiers more we will make use of this method in a restaurant review system to
help us pick a good restaurant to have some tasty pancakes.
Going by the reviews for a given restaurant we can notice a mixed set of reviews; some negative, some
positive, and some neutral. Here are some of the possible reviews for a given restaurant:
"I don't think I have ever eaten better pancakes anywhere else. Service is quick"
"My wife tried the strawberry cake and it was pretty forgettable"
"Pancakes were great but ambience was dull"
Since we are only interested in having pancakes, I can ignore other food item reviews. Hence, going by
the sample review, we see that two reviews were positive for pancakes. This should help us identify that
this particular restaurant is a good place to visit, provided we only care about the pancakes.
2016 EMC Proven Professional Knowledge Sharing 11
Thus, our machine learning algorithm should be able to tell us we can go to a given restaurant or not by
means of classification. We can lay it down as:
Figure 11
Linear Classifier
How does this work? How will our system recognize a positive and a negative review?
We will list a set of words that should be present in a good review such as great, awesome, good,
wonderful, etc. and a set of words that would make the review as bad, i.e. pathetic, bad, awful, etc.
Take a given review and break down the sentence into words and look for positive and negative words.
Count the total positive words and total negative words. If total positive words are greater than negative
words, we can classify the review as positive.
Figure 12
If we consider only the experience for pancake we know that it's positive on an average. This type of
classification is fine but in reality, different words carry different weights. For example great is better
than good, awesome is better than fine, etc. We will give weights for these words and calculate the
overall weight of the sentence.
Let’s consider those five words that were in our scope of analysis and each have been given certain
weight.
2016 EMC Proven Professional Knowledge Sharing 12
Let’s say a review is like:
Pancakes were great
the food was awesome
but the service was terrible
Total score = 1.5+ 1.7-2.1=0.8
This is overall a good review.
Hence, by knowing the weight of each word we know the review is good or bad; this is linear
classification.
Evaluating a classifier and classifier error
We have developed our macine learning program using classifier model and we want to understand how
accurate our model is. If it’s inaccurate, then we want our program to adjust and make modifications in
its system.
We have a given set of sentences – sentence 1, sentence 2….sentence N. We know the reviews of these
sentences whether they are positive or negative.
Of the given data set of sentences, we will remove some sentences and call it a test set. We will call the
remaining set of sentences a training set.
Figure 13
Just like in regression model we have classified our available list of data sets into two categories Training
set and Test set. Training set will be used to build our classifier and test set will be used to evaluate. We
also know the sentence review is positive or negative and this is called an actual label. We hide the
actual label of test set. Using the training set, our model will classify the sentence as positive or
negative. It will also build up weights of the words, for example good = 1.0, awesome = 1.7, bad = -1.0,
awful = -3.3, etc.
Now we will use the test set of sentence and run it through our classifier and check what our model
predicts. We will compare the prediction with the actual label whether its positive or negative. If what
the machine predicts matches the actual label then we have correctly predicted the review. If what the
machine predicts does not match the actual label, it’s a wrong prediction. We will count the total
number of sentences correctly reviewed and sentences incorrectly reviewed. From this we can tell the
2016 EMC Proven Professional Knowledge Sharing 13
error rate of our classifier.
And similarly Accuracy will be calculated as,
Types of Errors
In our case of restaurant reviews, we have either positive prediction or negative prediction (two classes
expected – positive or negative)
In random guessing for two classes (positive/negative), we could achieve 50% accuracy just by guessing.
Hence our machine learning classifier should be able to exceed at least 50% accuracy. If the number of
classes were more than two, then for K classes we can randomly guess 1/K times correctly. In this case,
our classifier should be better than 1/K value to be called as a reliable system.
Let's see the type of errors that we could make and represent it as a matrix called a confusion matrix.
Figure 14
Suppose our system is dealing with only two classes (positive or negative). If we predict positive and the
actual label is positive then we have True Positive. If we predict negative and the actual label is negative
then we have True Negative. Similarly, if our system predicted negative and actual label is positive then
we have False Negative (FN). And if our system predicted positive and actual label is negative then we
have False Positive (FP).
Let's apply these to applications and see the accuracy.
We have spam filters and medical diagnosis and we see the impact of the errors.
2016 EMC Proven Professional Knowledge Sharing 14
In the case of spam filters if we have a false negative prediction it means that an email was not spam but
was regarded as spam. The cost of this kind of miss could be annoying. Sender will not get his response
as his email went straight to the spam folder on the receiver side.
If an email was predicted as good but it was actually spam, it may turn out to be an expensive miss to an
organization, i.e. the Security system of the organization could be compromised.
In cases of medical diagnosis, if our system makes a wrong prediction (FN) it would mean that it did not
detect a disease. False Positive (FP) would mean it detected something that was not there and lead to
unnecessary treatments which would be costly to the patient. Both cases – FN or FP – seem to have a
heavy price to pay. Which is more severe depends more on what type of diseases was not detected, side
effects of unnecessary treatment, and so forth. Hence, we notice different type of applications have
different degree of tolerance to FN and FP.
Calculating the accuracy when more than 2 classes exists
Let’s say we have 100 test cases to detect if they are healthy or suffer from cold or suffer from a more
dangerous case of flu.
Suppose that out of the 100 cases 70 are healthy, 20 have a cold, and 10 have flu. Out of this, 60 were
correctly classified as healthy, 12 with a cold, and 8 with flu.
Let's look at FN, FP cases. In row 1, 60 out of 70 were correctly predicted as healthy, 8 cases were
wrongly diagnosed as a cold, and 2 were wrongly diagnosed as having flu.
In row 2, while detecting for cold, it was difficult to diagnose 8 cases. four were mistaken as flu and the
remaining were labeled healthy.
In row 3, out of 10 flu cases, 2 were wrongly diagnosed as a cold.
Here our accuracy is "total correct cases diagnosed"/total number of cases which is 80/100=80%
2016 EMC Proven Professional Knowledge Sharing 15
Bias of Machine learning
We know that if we have less data to build our classifier, the error it makes is high. As the amount of
data increases, error decreases. If we plot it in a graph it would look as shown in Figure 14.
Figure 14
The gap between the finite classifier error and zero error is the bias. Even as the amount of data is
infinite our model will still make errors in predictions. This is the error of the classifier.
This classifier was built by analyzing single words. So a review like "Pancakes were good" and "Pancakes
were not good" would both be listed as positive as the classifier only took the word good for analysis. If
the classifier used a pair of words then it would know that there is a difference between "good" and
"not good". Assign a weight for "good" as +1 and "not good" as -1.5 and build the classifier. If we now
plot the error of this classifier to the amount of data, it would look like the graph in Figure 15. We see
that even with double word analysis there is still some bias. Having more data is good and having a more
complex model will have less error.
Figure 15
2016 EMC Proven Professional Knowledge Sharing 16
Probabilities
When we make a prediction, how confident we are in our predictions is quite important. For example, a
review like "pancakes were awesome and everything else is great", is definitely positive.
If we say "pancakes were great but service was just ok" then confidence is less.
We will represent this confidence as probability here
P(Y/X) = value
If X is the input statement, then the Y output probability =0.00 to 0.99
If confidence is high, we can say P(Y/X)=0.99 and if confidence is low we can say P(Y/X)=0.5
We can use probabilities as well to influence our judgment in deciding if we go to a given restaurant or
not.
2016 EMC Proven Professional Knowledge Sharing 17
Document Retrieval and Clustering
Suppose a person reads an article online. If the reader likes the article, the reader would like to read
more similar articles. Let’s say we get a list of suggested articles and ironically we notice that these
articles are of readers liking as well. How did we get a recommendation of articles that a reader might or
might not like? This is yet another field where machine learning has made a mark.
In a few years a reader has developed his writing skills and has written a wonderful article that gets
published online. But the author forgot to tag the article to which type of article it belongs to like sports,
politics, economics, etc. Not to worry. We can have the article tagged to appropriate groups based on its
content by a machine learning algorithm.
How does this happen? How can a machine that only understands 0’s and 1’s now understand complex
topics like economics, history, or politics. Let’s look at yet another amazing machine learning model.
We have two statements – "I love to walk in evening. Evening walk helps me burn those excess calories"
This is from one of the articles written. There are many articles that have a wide range of topics.
Any article that we have is made up of words;for our program to be intelligent enough to give
suggestions, it needs to break down the entire article into words and make a count of the number of
times the word is repeated. Now a given article can be represented by word count vector
Figure 16
A word count vector is a vector that keeps count of each word that appears in the article. Here, the
statement “evening” and “walk” appear twice. Words like “help”, “me”, “love”, “I”, “calories”, etc.
appear once. Other words like cat, group, gym, dumbbells, etc. are words in vocabulary that didn’t
appear in our article.
Now that we have broken down an article into a bag of words using word count vector, we can do this
to all the other articles in our collection. The statement used (“I love to walk in evening. Evening walk
helps me burn those excess calories”) does show it belongs to the health and fitness category. Now we
compare our word count vector with other articles in the collection. If we get similar articles, their word
count vector should have non-zero values on keywords like calories, walks, running, etc. (words which
we know would most likely be used in a health fitness article)
Thus, if we compare word count vector between two similar articles, it would be more like this:
2016 EMC Proven Professional Knowledge Sharing 18
Figure 17
If we multiply word count vectors for each word and sum it up,
Figure 18
=5x3+3x2+1x1 = 15+6+1 = 22
We get a non-zero value which signifies articles were similar
Let’s compare word count vector of health and fitness with word count vector from topic on economics
Figure 19
And just like before lets multiply the word count vectors and sum it up.
Total sum = 0 (indicating that articles were not similar)
This shows that dissimilar topics have zero-sum of multiples of word count vector and similar articles
have non-zero sum of multiples of word count vector.
Flaw in comparing similar word counts
Going by the approach explained above we notice that similar topics tend to have a non-zero result
when we compare the word count vectors. However, we know that a given article has words like “the,
this, are” etc. that are more common among all articles irrespective of similar or dissimilar articles. If we
compare dissimilar articles with these very commonly occurring words then even dissimilar topics will
seem to be similar in vector count comparison as we get non-zero results.
Thus, we need to lower the impact of these commonly occurring words and increase the impact of
important words that would define the type of article.
We have
2016 EMC Proven Professional Knowledge Sharing 19
Common words – like “this”, “that”, “the”, “us”, “are”, “hence”, “so” etc.
Rare words – like “Tiger woods”,” Michael Schumacher”, “Messi”
Important words – like “Football, “Basketball”, “Golf” etc.
We need to ignore the common words and decide upon which is more important between important
words and rare words. Both rare and important words are required to define the article or document
type. Someone reading an article on Michael Schumacher does not necessarily mean he will not like
articles on Messi and someone reading an article on football does not mean he will not like articles on
other sports. So we need to get a balance of both rare and important words to reach a conclusion.
Normalizing Vector
Now let’s say we read two articles in the sports category. We can compare the word count vector
Figure 20
Sum of their multiples will be like this
15+6=1 =22
Now if we just read a document that has nothing but document 1 written twice and we have another
article that has nothing but document 2 written twice. Comparing their word count vectors will be
Figure 21
Sum of multiples of word count vector = 60+24+4=88
However, just because Doc3 and Doc4 yielded a higher sum of multiples of each element (88 vs 22), it
does not mean Doc 3 and Doc 4 is more interesting than Doc 1 and Doc 2. Both should show the same
amount of interest factor or grade of liking. To fix this instead of summing up the multiples, we use
normalization of vectors.
Normalized word count vector for Doc 1= Square root of (12 + 52 + 32 +12) = Square root of (36) = 6
Normalized word count vector for Doc 2= Square root of (12 + 32 + 12 +22 + 12 ) = Square root of (16) = 4
2016 EMC Proven Professional Knowledge Sharing 20
We will represent the normalized vector value along with word count to give the documents an equal
footing regardless of its length
Figure 22
Sum of multiples of normalized vector word count = 5/6*3/4 + 3/6*2/4+ 1/6*1/4 =0.916
Similarly, we will calculate a normalized vector to Doc 3 and Doc 4.
Doc 3 Normalized word count vector = Sqr root of (22 + 102 + 62 +22) =12
Doc 4 Normalized word count vector = Sqr root of (12 + 62 + 22 + 42 + 22)=8
Comparing Doc 3 and Doc 4 normalized vector count,
Figure 23
Sum of multiples of normalized vector word count = 10/12*6/8 + 6/12*4/8 + 2/12*2/8 = 0.916
This way, irrespective of document length, it gets a “normalized word count vector” to make it as likable
as a document of twice the size having the same content.
Term frequency and Inverse document frequency (TF and IDF)
A given article is broken down into words and each word is counted for the number of times it appears.
We create the word count vector; this is the term frequency (TF).
Figure 24
To neutralize the impact of commonly used words like “the”, “that”, “this”, etc. in all the documents we
will use a logarithm to base 2 functionality and call it inverse document frequency.
Let’s say we have 128 documents in total and the word “the” appears in all the documents; then,
2016 EMC Proven Professional Knowledge Sharing 21
IDF of word “the” = log2 (128/ (1+128)) = log2 (approx. 1) = 0
Let’s say the word “Eminem” appears in 3 documents out of 128 document collection.
IDF of word “Eminem” = log2 (128/ (1+3)) = log2 (32) = 5
Now we have word count vector of TF and IDF
Figure 25
“The” appeared 1000 times all 128 documents; Eminem appeared 5 times in 3 documents
Multiply TF with IDF we get
Figure 26
In this way common words are down-weighted and rare words are up-weighted. So now we know to
represent the document (Bag of words) and also how we can find similar articles (using TF, IFD, and
comparing the word count vector). To suggest another article we will need to go through all the articles
or documents and compare them with our test article and find the similarity until we find the most
similar document or article. The most similar article from all the available articles can be suggested as
output or we can also suggest a set of similar articles.
2016 EMC Proven Professional Knowledge Sharing 22
Clustering
Suppose an article has been written, how do we categorize it correctly into a category it fits well within?
A sports article needs to be tagged to sports category, an entertainment article needs to be tagged to
entertainment category, and so on. As humans we can easily recognize them, but how would a machine
be able to perform the same task. Classifying documents into specific categories is the process of
clustering. By clustering, we can achieve faster search results and get other important related contents
from documents in the same cluster.
There are two ways to achieve this clustering of documents – one way is by supervised learning and
another way is unsupervised learning.
Supervised Learning
We have a set of articles that have been labeled into their respective categories.
Figure 27
Now a new article needs to be categorized into one of the above-mentioned categories. Our program
will just need to compare the TFxIDF with all the other documents and this will help us find the closest
type of article. With closest matching article we can label our new article accordingly.
What happens when we don’t have these set of labels? We will deal with this in unsupervised learning.
Unsupervised Learning
In cases where data sets are unlabeled, we want our program to keep the data in clusters.
Note: here we don’t know which data set entry belongs to which cluster.
2016 EMC Proven Professional Knowledge Sharing 23
Figure 28
Here, clustering can be done by using a very popular algorithm – K-means clustering.
K stands for the number of clusters that need to be created.
With the given data set, if we want to cluster the data sets into two clusters we begin by choosing two
cluster centroids at random locations as shown in the picture below.
Figure 29 and 30
One of the Cluster centers is labeled as Red cluster and the other is labeled as Blue cluster centroid.
Next, all the data sets closest to Red centroid are labeled as red and all the data sets closest to Blue
centroid are labeled as blue (Figure 30).
Using all the data sets of red, a mean/average of the location is calculated; similarly, mean/average of
location is calculated for blue cluster data points.
2016 EMC Proven Professional Knowledge Sharing 24
Next, the Red cluster centroid and Blue cluster centroid are made to move to the average location
calculated as shown in the Figure 31
Figure 31 and 32
Once the new cluster centroids are moved, we make all the data points closest to new Red centroid as
part of red cluster and all the data points closest to new Blue centroid as part of blue cluster (Figure 32).
Now, with the new data points being part of red and blue clusters again mean location is calculated for
each cluster. To this new mean location, centroids are moved and so on as shown in Figure 33.
Figure 33 and 34
This cycle of calculating the mean and moving of centroids goes on until new mean location calculated is
the same as the previous attempt. We reach a point when mean locations of the two clusters don’t
move any further. Now we have separated the available data sets into two clusters.
We have talked about clustering via supervised and unsupervised learning methods. Besides helping in
document retrieval, there are other fields where clustering can be useful.
2016 EMC Proven Professional Knowledge Sharing 25
For example, if you have a website that has hundreds and thousands of images, you can group these
images into clusters that contain similarities; i.e. groups for ocean pictures, animal pictures, nature
pictures, etc. This way search results on the website can list similar pictures of the same group.
Another example where clustering could be useful is monitoring seizures activities of patients. Just by
monitoring the seizures of different patients, we can classify patients with similar seizure recordings into
groups and help treat patients more effectively.
2016 EMC Proven Professional Knowledge Sharing 26
Deep Learning
When an image is given to us (humans) like the one shown below, we can quickly identify contents of
the image and say that it’s a dog in the picture or more precisely, it’s a golden retriever.
Figure seen by a Human eye Image data as 0’s and 1’s
Figure 35 and 36
To us, identifying an image seems such a trivial task; however it’s not the same case with machines
which can do calculations much faster than humans could possibly do. To a machine, this picture is just
built of 0’s and 1’s. Traditional programming is made up of a set of lines written in a machine language
which strictly tell what a computer needs to do and computers running these programs just follow the
code. This just does not help a machine to learn. When we teach a child that the above image is a dog,
the child can quickly learn it. Next time when an image of the dog is shown, the child can recollect from
its learning and identify that it’s a dog. It’s not as simple when it comes to machines.
We humans are gifted with a super computer-like brain that has evolved over millions of years. Our
brains are composed of millions of neurons that are well connected and these neurons can hold
information. Neurons communicate with other neurons by sending electrical or chemical signals via
synapses.
Figure 37
2016 EMC Proven Professional Knowledge Sharing 27
Neurons form the core of the brian and spinal cord and give us consciousness.
By understanding how our brains work, programmers have come up with a similar way of programming
called the neural networks. Just like neurons, we have nodes and these nodes trigger other nodes and so
on until we reach our desired output.
Figure 38
Input nodes trigger other nodes. Each node influences other nodes in different ways which are
represented by weights (thickness of line). When output nodes don’t have the desired output, these
weights are adjusted until desired output is obtained. Figure 38 depicts a simple single layer neural
network.
Each node triggers one or more nodes and a given node has the ability to accept an input or reject an
input. Let’s say we have a simple single layer neural network as shown below.
Figure 39
2016 EMC Proven Professional Knowledge Sharing 28
We will use colors to represent the type of value a node takes based on the input it receives. Node A
(Green) and Node B (Orange) trigger Node C with respective weights W0 and W1. Assume weight W1 has
more impact than weight W0, then Node C takes on a value that is more similar to Node B and hence
takes a value of lighter orange. Now Node C triggers output Node D and Node D has a Purple color.
However, Node D’s output which is purple is not the desired output. It should be blue. By comparing the
obtained output results and desired output results we can perform a simple subtraction and determine
that the error value was light blue.
Figure 40
After calculating the error in output, weights will be adjusted to compensate for the error and the
output is calculated again. This time, output Node D should be blue, but we notice it’s not exactly blue
as we predicted. Output Node D has a light purple value. Again a difference between desired output and
obtained output is calculated and weights are adjusted to compensate for the error. This continues until
output D is blue as we desired. In neural networks, these types of iterations don’t complete in just one
or two iterations; normally iterations happen again and again millions of times until the desired output
is obtained.
As explained earlier in our machine learning models, our available datasets are split into a training set
which is used to build our neural network while the remaining datasets are used to evaluate our neural
network. Here, we need to have huge amounts of data to be more accurate in results. As seen, in order
to obtain even simple outputs requires going through millions of iterations which means fast processing
is imperative. While the neural network has existed for more than 50 years, it’s increased in interest
over the past 10 years due to better computing and GPU developments.
When machines need to recognize simple patterns, tools like regressions should be enough. But when
task complexity increases (more and more parameters), then the neural network tends to be more
efficient. As the number of layers between input and output nodes increases, neural network becomes
deep neural network or deep learning.
2016 EMC Proven Professional Knowledge Sharing 29
Image recognition
Earlier we talked about recognizing an image; let’s look more into how deep learning helps in
recognizing an image. Consider the image shown in Figure 42.
Figure 42
In traditional programming if we could write an eye detector, a nose detector, or a mouth detector, then
running these detectors on an image will help identify if a given image is a face. However in reality,
there are no nose detectors or eye detectors or mouth detectors that we use in detecting a face.
To solve this problem of identifying an image, we will be using feature extractors codes. There are many
hand-designed feature detection codes that have been written to detect specific interest points. These
feature detection codes capture certain statistical property of the image. One such hand written feature
detection code is SIFT (Scale invariant feature transform).
Figure 43
A given image is run through the SIFT algorithm and this algorithm looks for certain key points in the
image where SIFT features are found. A vector is created to represent where the key points/SIFT
2016 EMC Proven Professional Knowledge Sharing 30
features were detected. This can be viewed similar to a bag of words vector. With the SIFT features
identified we can run this through a simple classifier to detect if the image is a face.
To explain the sift algorithm more easily pictorially, let’s examine the set of images below that we need
to detect on a test image. We are given the following:
Here is the picture that we have to look into to identify the above images
After running the images through the SIFT feature detection then it will result in
2016 EMC Proven Professional Knowledge Sharing 31
Figure 44,45,46,47,48,49
The big rectangle is the image detected and small squares are the features within an image detected
which also show the orientation in which features were detected.
After detecting the SIFT features we can track images, detect images, and identify objects whenever
needed.
Along with SIFT, we have SPIN Image, Textons, RIFT, GLOH, and HOG feature detection codes that are
used in various other fields of deep learning.
Image recognitions via deep learning can also be incorporated in other fields, i.e. online shopping. When
selecting a dress for purchase we are often provided images of similar looking dresses that might also be
of interest as well. Big online selling companies which don’t sell anything tangible are making profits
largely owing to how they present, organize, and display the data. Another field where image
recognition is helping is in recognizing and reading hand-written scripts and converting them to digital
form. Many smart phones can interpret what is written via a stylus on the phone and convert that to
digitized text.
Using deep neural network, there are apps that can read sign boards of a foreign language. These apps
can convert images having text in a foreign language which we cannot read or speak to a language that
you’re more comfortable with.
Conclusion
We just looked into some of the machine learning algorithms available. There are many other algorithms
that are making an impact in a way where machines are able to do tasks that cannot be done via
traditional coding. The machine learning algorithms explained here have the capability to make
predictions that basic analytics cannot do. These algorithms work more efficiently when the amount of
data fed increases. By keeping a validation system, machine learning codes can learn from the error it
performed by comparing to expected output and readjust itself until the outputs obtained are more
expected. As the amount of data grows at geometric proportions that means we would have to deal
with large amounts of unlabeled data. When we upload an image online it will be classified accordingly
2016 EMC Proven Professional Knowledge Sharing 32
by machine learning code which will perform image parsing and identify what image could be showing,
by comparing it with what it learned previously from already available images. For example, we have
hundreds of pictures of birds available online and when we upload a new image of a bird, the search
engine can identify similar looking images and identify that new image shows a bird (Deep learning).
Classifying the data into similar clusters helps improve our search engine results. If we search for a
keyword in any popular search engine today, we not only get the output that is required but also
anything that’s similar to it in other fields. For example, a search for keyword ‘storage’ would result in
search engines not only showing results of computer storage drives, but also other storage-related
outputs that could be what the user is looking for, i.e. furniture storage, container storage, food storage,
etc. We never specified what we were looking using just keyword ‘storage’. What helped us here was
machine learning techniques like clustering, classifying, etc.
Imagine a future where doctors were unavailable due to some reason to diagnose our conditions;
instead, we had our phones or other handheld portable devices that use our symptoms and help us
identify if we are suffering from any medical condition. The idea of a machine telling us our medical
condition sounds a bit unrealistic now. This is a general reaction but with more and more data and
better machine learning models, errors made in prediction are being reduced drastically. Competitions
like Kaggle’s second annual data science bowl for Year 2016 – “Transforming How We Diagnose Heart
Disease” – is a shining example of what is helping us achieve machine-based diagnosis.
It’s safe to say, thanks to machine learning we have a better future ahead.
2016 EMC Proven Professional Knowledge Sharing 33
Glossary
AI - Artificial intelligence
FN - False Negative
FP - False Positive
IDF - Inverse Document Frequency
ML - Machine Learning
Sqr - Square function
TF - Term Frequency
2016 EMC Proven Professional Knowledge Sharing 34
References
Machine Learning a practical introduction
http://www.infoworld.com/article/3010401/big-data/machine-learning-a-practical-introduction.html
Practical Machine Learning Problems
http://machinelearningmastery.com/practical-machine-learning-problems/
Machine Learning
http://www.sas.com/en_us/insights/analytics/machine-learning.html
SIFT
http://aishack.in/tutorials/sift-scale-invariant-feature-transform-introduction/
From feature descriptors to deep learning: 20 years of computer vision
http://www.computervisionblog.com/2015/01/from-feature-descriptors-to-deep.html
Deep Learning
https://en.wikipedia.org/wiki/Deep_learning
Advancing Business with Advanced Analytics
https://www.gartner.com/doc/3090420?ref=SiteSearch&sthkw=machine%20learning&fnl=search&srcId
=1-3478922254
Deep Learning
http://neuralnetworksanddeeplearning.com/chap6.html
Machine learning foundation
https://www.coursera.org
Kaggle's second annual data science bowl
https://www.kaggle.com/c/second-annual-data-science-bowl
2016 EMC Proven Professional Knowledge Sharing 35
Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication requires an
applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.