Machine learning (domingo's paper)

10
Questions from paper "A Few Useful Things to Know about Machine Learning" Reference: http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf By: Akhilesh Joshi mail: [email protected] m

Transcript of Machine learning (domingo's paper)

Page 1: Machine learning (domingo's paper)

Questions from paper

"A Few Useful Things to Know about Machine Learning"Reference: http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

By:Akhilesh Joshi

mail: [email protected]

Page 2: Machine learning (domingo's paper)

1. What is the definition of ML?

Machine Learning is art of using the existing data (historical and present) to forecast/predict ideal solutions with the help of implementing statistical models without (or lesser) manual intervention. However techniques for Machine Learning are still in development, it is one of the important concept in field of data science with various applications that will be helpful to mankind.

2. What is a classifier?A Classifier is a system where we provide inputs to the system (inputs may be discrete or can be continuous) and classifier gives us an output. The data that we provide to the classifier is called as training data. So the main aim of classifier is to provide an output based on our training data and that output will correctly classify our test data to get more ideal results.

3. What are the 3 components of a learning system, according to the author? Explain them briefly.

There are 3 components described about the learning system. They are as follows.

a. Representation Representation is very important aspect for application of ML on our set of data. Here

we understand how we should represent data so that it will fit perfectly. For example, a decision tree might be suited perfectly for our data whereas there can be neural networks that are best suited for other data.

b. Evaluation Evaluation helps us determining the good classifiers for the bad classifiers. Good

classifiers are those which provide a right set of hypothesis that are best suited for our test data. For student data we might need “Likelihood” evaluation parameter for getting a job rather than “Precision and recall” evaluation for getting a job. Evaluation step helps us in determining the same.

c. Optimization Out of all the possible outcomes for our hypothesis we have to decide which of the

hypothesis provide us with optimal solution for our test data. Here we use the best suited hypothesis for arriving at most ideal solution.

4. What is information gain?

Given the number of attributes we have to decide the attribute which has maximum information gain. We calculate the average entropy and compare the sum of entropies to the original set. This will help us to build a decision tree. Where an attribute with highest information gain will be at root node, then again we subdivide the further tree nodes by comparing the information gains w.r.t to the root attribute that we have already chosen. The order the splits in a decision tree is in decreasing order of information gain.

Page 3: Machine learning (domingo's paper)

Formula for information gain:

IG (A) = H(S) - Σv (Sv/S) H (Sv)

IG (A): Information gain IG over attribute AH(S): entropy of all examplesH (Sv): entropy of one subsample after partitioning S

5. Why is generalization more important than just getting a good result on training data i.e. the data that was used to train the classifier?

Using training data provides us an insight how our data looks like. So training our machine learning algorithms on that particular set of data won’t guarantee the algorithm to work correctly on the test data. There might be a case where our test data is complete different than our training data and the output may be not as desired. So we have to consider both scenarios where our algorithm will work on both our training data and test data. Hence the concept of generalization.

6. What is cross-validation? What are its advantages? Given the training data S and hypothesis class H (it contains all the possible hypothesis)

we have to find h (correct hypothesis for our data). So to find h correctly we make the use of cross validation process to have a data with maximum advantage.

Advantages of cross-validation

Data is tested on both training and test data giving the algorithm clear insights about the type of data that it might see or use for evaluation purpose

We can set aside our training data as a part of our testing data which helps us to use that test data for testing the working of our algorithm to give desired ideal solutions.

Since already a set of data that is set aside as our test data, we need not have to worry about having a test data.

Illustration of cross-validation:

Page 4: Machine learning (domingo's paper)

7. How is generalization different from other optimization problems? Optimization problems are more aligned to the data that is already known. Whereas in

generalization we have to assume the errors and findings from our training data that will help us to infer about test data or at least will try to infer something about test data. Since optimization deals with more ideal situations where most of the things are known already we can expect the outputs as desired, which is not the case of generalization problems.

8. If you have a scenario where the function involves 10 Boolean variables, how many possible examples (called instance space) can there be? If you see 100 examples, what percentage of the instance space have you seen?

Number of instances can be defined by 2N (where N is the no. of Boolean variables). So in our case total instances will be 210 i.e. 1024 instances. Now we are given with only 100 examples so we will be able to see only 9.76 % of instance space.

9. What is the "no free lunch" theorem in machine learning? You can do a Google search if the paper isn't clear enough.

NO FREE lunch suggests that no learning algorithm is inherently superior to other algorithms. If an algorithm is performing well in particular class of problem, then it should be performing worst in other class of a problem i.e. performance here is compensated. If we average the error for all possible weight in an algorithm, then we will get difference in expected errors as ZERO between those two algorithms.

10. What general assumptions allow us to carry out the machine learning process? What is the meaning of induction?

Induction is making the use of available knowledge to turn it into large amount of knowledge.

11. How is learning like farming? Farming is more kind of dependent activity where it depends on Nature. Along with the

help of Nature farmers combine seeds with nutrients to grow crops. In similar manner to grow programs (like crops), a learning has to combine knowledge (logic) with data for growing the programs.

12. What is overfitting? How does it lead to a wrong idea that you have done a really good job on training dataset?

Overfitting is when model learns from more training data. When we have more training data then the model gets used to the characteristics of the training data which even includes the noise and error of it. Now when it comes to apply the learning that model learned on training data, the results are not as expected and the model might not work well on the test data. It negatively impacts on models ability to generalize. It is highly likely that we will get test data same as our training data.

Page 5: Machine learning (domingo's paper)

13. What is meant by bias and variance? You don't have to be really precise in defining them, just get the idea.

Bias: Learners erroneous assumptions in learning algorithms. Low Bias → more assumptions High Bias → less assumptions

Variance: Amount of estimate for a model to change with different training data is used.

14. What are some of the things that can help combat overfitting? Use of following techniques might help in combating overfitting cross-validation Adding a regularization term to the evaluation function. perform a statistical significance test like chi-square before adding new

structure

15. Why do algorithms that work well in lower dimensions fail at higher dimensions? Think about the number of instances possible in higher dimensions and the cost of similarity calculation

As the dimensions increase the amount of data that is required to train a model (in this case algorithm) the amount of data needed grows exponentially. In a way algorithms with lower dimensions can generalize (keep sync in training and test data) in a better way than dealing with maintaining generalization with higher dimensionality. Same phenomenon is explained by “Curse of Dimensionality”

16. What is meant by "blessing of non-uniformity"? This refers to the fact that observations from real-world domains are often not

distributed uniformly, but grouped or clustered in useful and meaningful ways.

17. What has been one of the major developments in the recent decades about results of induction?

One of the major developments is that we can have guarantees on the results of induction, particularly if we’re willing to settle for probabilistic guarantees.

18. What is the most important factor that determines whether a machine learning project succeeds?- Success of the project depends upon number of features used. If we have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, we may not be able to learn it.

19. In a ML project, which is more time consuming – feature engineering or the actual learning process? Explain how ML is an iterative process?

Feature engineering forms the more time consuming process for machine learning since it deals with many things such as gathering data, cleaning it and pre-process it.

In ML we have to carry out certain tasks iteratively such as running the learner, analyzing the results, modifying the data and the learner. Hence it is an iterative process.

Page 6: Machine learning (domingo's paper)

20. What, according to the author, is one of the holy grails of ML? According to the author, the process of automating feature engineering

processes is the holy grails. It can be done by generating large no. of candidate features and selecting the best based on their information gain w.r.t class. But it has some limitations.

21. If your ML solution is not performing well, what are two things that you can do? Which one is a better option?

When an ML solution does not perform well we have two main choices

. To Design a better learner algorithm

. Gather more data.

It is always better if we go for collecting more data because a dumb algorithm with more and more data beats a clever algorithm with modest amount of data

22. What are the 3 limited resources in ML computations? What is the bottleneck today? What is one of the solutions? The 3 limited resources in ML computations are:. Time. Memory. Training DataThe bottleneck has changed from decade to decade and today it is “Time”. If there is more data then it takes very long to process it and learn the complex algorithm. So the only solution for this is to come up with a faster way to learn the complex classifiers.

23. A surprising fact mentioned by the author is that all representations (types of learners) essentially "all do the same". Can you explain? Which learners should you try first?

All learners work by grouping nearby examples into the same class, the key difference is in the meaning of nearby. With non-uniformly distributed data, learners can produce widely different frontiers while still making the same predictions in the regions that matter. It is better to try the simplest learners first. Complex learners are usually harder to use, because they have more knobs you need to turn to get good results, and because their internals are more opaque

24. The author divides learners into two types based on their representation size. Write a brief summary.

According to the author there are two types of learners based on representation size.

1) Learners with fixed representation size

2) Learners whose representation size grows with data

Page 7: Machine learning (domingo's paper)

Fixed-size learners can only take advantage of so much data. Variable-size learners can in principle learn any function given sufficient data, but in practice they may not, because of limitations of the algorithm or computational cost or the curse of dimensionality. For these reasons, clever algorithms those that make the most of the data and computing resources available often pay off in the end.

25. Is it better to have variation of a single model or a combination of different models, known as ensemble or stacking? Explain briefly.

Researchers noticed that, if instead of selecting the best variation found, we combine many variations, the results are often much better and at little extra effort for the user. In ensembling we generate random variations of the training set by resampling, learn a classifier on each, and combine the results by voting. This works because it greatly reduces variance while only slightly increasing bias.

26. Read the last paragraph and explain why it makes sense to prefer simpler algorithms and hypotheses.

When the complexity is compared to the size of hypothesis space, smaller spaces of hypotheses are allowed to be represented in shorter codes. A learner with a larger hypothesis space that tries fewer hypotheses from it is less likely to overfit than one that tries more hypotheses from a smaller space. So it makes sense to prefer simpler algorithms and hypotheses as more the number of assumptions to make, more unlikely explanation is.

27. It has been established that correlation between independent variables and predicted variables does not imply causation, still correlation is used by many researchers. Explain briefly the reason.

In a prediction study, the goal is to develop a formula for making predictions about the dependent variable, based on the observed values of the independent variables. In a causal analysis, the independent variables are regarded as causes of the dependent variable. Many learning algorithms can potentially extract causal information from observational data, but their applicability is rather restricted. To find causation, you generally need experimental data, not observational data. Correlation is a necessary but not sufficient condition for causation. Correlation is a valuable type of scientific evidence in fields such as medicine, psychology, and sociology. But first correlations must be confirmed as real, and then every possible causative relationship must be systematically explored. In the end correlation can be used as powerful evidence for a cause-and-effect relationship between a treatment and benefit, a risk factor and a disease, or a social or economic factor and various outcomes.