HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

13
Developing Machine Learning Algorithms on HPCC/ECL Platform Industry/University Cooperative Research LexisNexis/Florida Atlantic University Victor Herrera – FAU – Fall 2014

Transcript of HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Page 1: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Developing Machine Learning Algorithms on HPCC/ECL Platform

Industry/University Cooperative Research

LexisNexis/Florida Atlantic University

Victor Herrera – FAU – Fall 2014

Page 2: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Developing ML Algorithms On

HPCC/ECL Platform

LexisNexis/Florida Atlantic University Cooperative Research

HPCC ECL ML Library

High Level Data Centric Declarative Language

Scalability Dictionary Approach

Open Source

LexisNexis HPCC

Platform

Big Data Management

Optimized Code

Parallel Processing

Machine Learning

Algorithms

Victor Herrera – FAU – Fall 2014

Page 3: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

FAU

Principal Investigator: Taghi Khoshgoftaar Data Mining and Machine Learning

Researchers: Victor Herrera Maryam Najafabadi

LexisNexis

Flavio Villanustre VP Technology & Product

Edin Muharemagic Data Scientist and Architect

Project Team

Victor Herrera – FAU – Fall 2014

Page 4: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Victor Herrera – FAU – Fall 2014

Machine Learning

Algorithm

Classification Model

Training Data

Constructor Businessman Doctor

Learning Classification

Constructor

Project Scope

• Supervised Learning for Classification

• Big Data • Parallel Computing

Implement Machine Learning Algorithms in HPCC ECL-ML library, focusing on:

Page 5: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

ECL-ML Learning Library

ML

Supervised Learning

Classifiers

• Naïve Bayes • Perceptron • Logistic

Regression

Regression Linear Regression

Unsupervised Learning

Clustering KMeans

Association Analysis

• APrioriN • EclatN

Project Starting Point April 2012

Victor Herrera – FAU – Fall 2014

Page 6: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Community

Machine Learning Documentation

Review Code

• Comment • Merge

• Commit • Pull Request

Process comments

ML-ECL Library Contributor

ML-ECL Library Administrator

• Use • Contribute

ECL – ML Library

HPCC-ECL Training

HPCC-ECL Documentation

Forums

ECL-ML Open Source Blueprint

Victor Herrera – FAU – Fall 2014

Page 7: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Victor Herrera – FAU – Fall 2014

Algorithm Complexity Naïve Bayes Decision Trees

• KNN • Random Forest

Language Knowledge

• Transformation • Aggregations

Data Distribution Parallel Optimization

Implementation Complexity

• Understand Existing Code

• Fix Existing Modules

Add Functionality Design Modules

Data Categorical (Discrete)

Continuous Mixed

(Future Work)

Experience Progress

Page 8: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Example: Classifying Iris-Setosa Decision Tree vs. Random Forest

Victor Herrera – FAU – Fall 2014

Page 9: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results

// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results

A4

0

A4 A4

1 1 2

A3

A4

1 2

<= 0.6 > 0.6

<= 4.7 > 4.7

<= 1.5 > 1.5 <= 1.8 > 1.8

<= 1.7 > 1.7

Example: Classifying Iris-Setosa Decision Tree vs. Random Forest

Victor Herrera – FAU – Fall 2014

Page 10: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results

// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results

Example: Classifying Iris-Setosa Decision Tree vs. Random Forest

Victor Herrera – FAU – Fall 2014

Page 11: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Victor Herrera – FAU – Fall 2014

Supervised Learning

Naïve Bayes

Maximum Likelihood

Probabilistic Model

Decision Tree

Top-Down Induction

Recursive Data Partitioning

Decision Tree Model

Case Based

KD-Tree Partitioning

Nearest Neighbor Search

No Model – Lazy KNN Algorithm

Ensemble

Data Sampling Tree Bagging

F. Sampling Rdn Subspace

Ensemble Tree Model

Classification

Class Probability

Missing Values Classification

Area Under ROC

Project Contribution

Page 12: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

• Development of Machine Learning Algorithms in ECL Language, focused on the implementation of Classification Algorithms, Big Data and Parallel Processing.

• Added functionality to Naïve Bayes Classifier and implemented Decision Trees, KNN and Random Forest Classifiers into ECL-ML library.

• Currently working on Big Data Learning & Evaluation : – Experiment design and implementation of a Big Data CASE

Study – Analysis of Results. – Classifiers enhancements based upon analysis of results.

Victor Herrera – FAU – Fall 2014

Overall Summary

Page 13: HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2

Thank you.

Victor Herrera – FAU – Fall 2014