AML 2264: MLA Documentation Review Research Paper Assignments Novel Presentation Assignment
Research paper presentation
-
Upload
akshat-sharma -
Category
Data & Analytics
-
view
56 -
download
0
Transcript of Research paper presentation
![Page 1: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/1.jpg)
1
![Page 2: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/2.jpg)
2
![Page 3: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/3.jpg)
3
Understanding Decision Tree Algorithm using R Programming
Language
Presented By: Akshat Sharma#1, Anuj Srivastava#2
#Computer Science Department, Integral University Lucknow, Uttar Pradesh, India [email protected] [email protected]
![Page 4: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/4.jpg)
4
AbstractDecision Tree is one of the most efficient technique to carry out
data mining, which can be easily implemented by using R, a powerful statistical tool which is used by more than 2 million statisticians and data scientists worldwide. Decision trees can be used in a variety of disciplines, such as for predicting which patient characteristics are associated with high risk of a disease. When used together, we can find the relevant set of data, from the large data stored in the Enterprise Data Warehouses (EDWs). This provides new opportunities for organizations to derive new value from their most valuable and abundantly available asset: information, to create a competitive edge.
![Page 5: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/5.jpg)
5
Introduction• The first three phases of Data Analytics Lifecycle- discovery, data preparation, and
model planning, involve various aspects of data exploration. The success of a data analysis project requires a deep understanding of the data, it requires a tool for data mining and presenting the data. We can use decision tree as a tool for data mining and R for presenting the data.
• Data Mining is an analytic process designed to explore data (usually large amounts of data - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction. The most common type of data mining is Predictive data mining and it is the one that has the most direct business applications. [1]
![Page 6: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/6.jpg)
6
• Fundamental learning methods that appear in applications related to data mining are-
1. Clustering2. Association Rules3. Regression4. Classification5. Time Series Analysis6. Text Analysis
![Page 7: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/7.jpg)
7
Introduction to ‘R’• The ‘R’ language is a project designed to create a free, open source
language which can be used as a replacement for the ‘S’ language, ‘R’ was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. ‘R’ first appeared in 1993; 23 years ago.• ‘R’ is named partly after the first names of the first two ‘R’ authors and
partly as a play on the name of ‘S’.• ‘R’ is an open source implementation of ‘S’, and differs from S-plus largely in
its command-line only format.• ‘S’ is a statistical programming language developed primarily by John
Chambers and Rick Becker and Allan Wilks of Bell Laboratories.
![Page 8: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/8.jpg)
8
Figure 1. A schematic view of how ‘R’ works. [2]
![Page 9: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/9.jpg)
9
Installing R• R is freely downloadable from http://cran.r-project.org/ , as is a wide
range of documentation. • Most users should download and install a binary version. This is a
version that has been translated (by compilers) into machine language for execution on a particular type of computer with a particular operating system. Installation on Microsoft Windows is straightforward. • A binary version is available for windows 98 or above from the web
page http://cran.r-project.org/bin/windows/base .
![Page 10: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/10.jpg)
10
![Page 11: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/11.jpg)
11
Packages in ‘R’• The most important single innovation in ‘R’ is the
package system, which provides a cross-platform system for distributing and testing code and data.• The Comprehensive R Archive Network (http://cran.r-
project.org) distributes public packages, but packages are also useful for internal distribution.• A package consists of a directory with a DESCRIPTION
file and subdirectories with code, data, documentation, etc. The Writing R Extensions manual documents the package system, and package.skeleton() simplifies package creation. [3]
![Page 12: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/12.jpg)
12
Package Descriptionbase Base R functions
datasets Base R datasetsgrDevices Graphics devices for base and grid graphics
graphics R functions for base graphicsmethods Formally defined methods and classes for R
objects
stats R statistical functionsutils R utility functions
Table 1. ‘R’ packages, loaded on startup. [4]
The libraries shown in Table 1 are loaded on the ‘R’ startup.
![Page 13: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/13.jpg)
13
Applications of ‘R’1. Clinical Trials2. Cluster Analysis3. Computational Physics4. Differential Equations5. Econometrics6. Environmental Studies7. Experimental Design8. Finance9. Genetics10. Graphical Models
11. Graphics and Visualizations12. High Performance Computing13. High Throughput Genomics14. Machine Learning15. Medical Imaging16. Meta Analysis17. Multivariate Statistics18. Natural Language Processing19. Official Statistics20. Optimization
![Page 14: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/14.jpg)
14
Prominent Users of ‘R’
![Page 15: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/15.jpg)
15
INTRODUCTION TO DECISION TREE
• A decision tree is also called a prediction tree. A decision tree uses a structure to specify sequences of decisions and consequences. Given input X={X1, X2,..,Xn}, the goal is to predict a response or output variable Y. [5] Each member of the set {X1,X2,…,Xn} is called an input variable.• A decision tree consists of nodes, and thus form a rooted tree, this
means that it is a directed tree with a node called root. There are no incoming edges on root node, all other nodes in a decision tree have exactly one incoming edge. An internal node is a node with an incoming edge and outgoing edges, internal node is also known as test node. Nodes with no outgoing edges are known as leaves or terminal nodes.
![Page 16: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/16.jpg)
16
• Figure 2 describes a decision tree that reasons whether to send an alert or to send a warning when an error is triggered.
• In Figure 2, to depict the tree in more clear way, we can represent the internal nodes as circles, and leaves as triangles, to tell them apart in one glance. Each node is labeled with the attribute it tests, and its branches are labeled with its corresponding values. The leaf nodes represent class labels, i.e. the outcome of all the prior decisions.Figure 2
![Page 17: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/17.jpg)
17
THE GENERAL DECISION TREE ALGORITHM
• The objective of a decision tree algorithm is to construct a tree ‘T’ from a training set ‘S’. If all the records in ‘S’ belong to some class ‘C’, or if ‘S’ is sufficiently pure, then that node is considered a leaf node and assigned the label ‘C’. The purity of a node is defined as its probability of the corresponding class. [6] • The algorithm constructs subtrees for the subsets of S recursively until
one of the following criteria is met: [7]1. All the leaf nodes in the tree satisfy the minimum purity threshold.2. The tree cannot be further split with the preset minimum purity
threshold.3. Any other stopping criterion is satisfied (such as the maximum depth of
the tree).
![Page 18: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/18.jpg)
18
DECISION TREES IN ‘R’• In ‘R’, ‘rpart’ is for modelling decision trees, and an optional package
‘rpart.plot’ enables the plotting of a tree. [8] The rpart package can be used for classification by decision trees and can also be used to generate regression trees. • To grow a tree, use: [9] rpart (formula, data=, method=, control=)
Where,
![Page 19: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/19.jpg)
19
• Entropy and information gain are common technical terms/notations used in decision trees. • Entropy is a measure of the number of random ways in which a
system may be arranged. For a data set ‘S’ containing ‘n’ records the information entropy is defined as,[10]• Let Hx is entropy of X defined as: [11]
(Here ‘Px’is the proportion of ‘S’)
![Page 20: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/20.jpg)
20
Conclusion• ‘R’ is the best at what it does- letting experts quickly and easily
interpret, interact with, and visualize data.• The use of ‘R’ for implementing decision trees for classification in data
mining is highly beneficial over SAS and Matlab because:i. ‘R’ is open source, ii. ‘R’ has a large library support, iii. ‘R’ supports visualization,iv. all of this, while being inexpensive in comparison to both. • Thanks to the worldwide community effort, ‘R’ keeps growing with
thousands of user-created packages built and added to enhance ‘R’ functionality.
![Page 21: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/21.jpg)
21
References[1] Data Mining Techniques. [Online]. Available:•http://documents.software.dell.com/Statistics/Textbook/Data-Mining-Techniques
[Accessed 21 November 2015].
[2] Emmanuel Paradis, “R for Beginners”, CRAN. Figure 1, pp. 8. [Online]. Available:•https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
[Accessed 22 August 2015].
[3] Thomas Lumley, “R Fundamentals andProgramming Techniques,” in R Packages, January 2015, pg. 416. [Online].
[4] Adedin Culhane, Harvard School of Public Health, BIO503, January 2013, “Introduction to Programming and Statistical Modelling in R”. [Online]. Available:
http://isites.harvard.edu/fs/docs/icb.topic1202070.files/Bio503_winter13.pdf [Accessed 22 August 2015].
[5] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pp. 192.
[6] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 197.
[7] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 197.
[8] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 206.
![Page 22: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/22.jpg)
22
[9] Data Mining Algorithms In R / Classification / Decision Trees. [Online]. Available:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Decision_Trees [Accessed 27 November 2015].
[10] Venkatadri.M, Lokanatha C. Reddy. (2010, Apr.-2010, Sept.). A comparative study on decision tree classification algorithms in data mining. International Journal of Computer Application in Engineering, Technology and Sciences (IJ-CA-ETS). [Online]. 1.1, pg. 1. Available:https://www.academia.edu/
[11] Anuj Srivastava, Dr. Vinodini Katiyar, Navjot Singh. (2015, June). AReview of Decision Tree Algorithm: Big Data Analytics. International Journal of Informative & Futuristic Research. ISSN (Online): 2347-1697, pg. 6.
![Page 23: Research paper presentation](https://reader035.fdocuments.us/reader035/viewer/2022070515/587b02471a28ab93488b6acb/html5/thumbnails/23.jpg)
23