Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.

download Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.

If you can't read please download the document

Transcript of Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.

  • Slide 1

Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin Slide 2 Contributions We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance. Slide 3 Motivation for large-scale Machine Learning Models operate on large data sets Large number of forecasting models New data arrives constantly and real-time training requirement Slide 4 Regression Tree Classification algorithm maps features target variable (prediction) Classifier uses a Binary Search Tree Structure Each non-leaf node is a binary classifier with a decision condition One numeric or categorical feature goes left or right in the tree Leaf Nodes contain the regression function or a single prediction value Intuitive to understand by domain users Effect for each feature Slide 5 Googles PLANET Algorithm Use distributed worker nodes coordinated using a master node to build regression tree Master worker 21-Sep-11USC DR Technical Forum5 Slide 6 OpenPlanet Give an introduction abolut OpenPlanet Introduce difference between OpenPlanet and PLANET Give specific re-implementation details Controller InitHistogramExpandNodeInMemoryWeka Model File Threshold Value(60000) Slide 7 Cotroller Controller{ /*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] ); /*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet); /*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do { /*If any Set is not empty, continue the loop*/ /*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){ processing_nodes