Apache Spark Use case for Education Industry

24
© 2016 IBM Corporation 1 Academic Alert System Presenter: Vinayak Agrawal [email protected]

Transcript of Apache Spark Use case for Education Industry

Page 1: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 1

Academic Alert System

Presenter: Vinayak Agrawal

[email protected]

Page 2: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 2

Agenda

Use Case

Use Case Architecture/Work Flow in Weka

Data Volume

Problem Statement

Our Analytical Platform

Spark Workflow

Result Comparison between Weka and Spark

Spark Challenges

Q&A

Page 3: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 3

Use Case: Academic Alert System Academic Institutions get performance based funding on parameters* like

Student Retention – Retention Rates

Student Graduating – Completion Rates

Academic Institutions wants to be proactive in providing academic

feedback to students BEFORE they appear in final exam.

*Source: http:///www.ncsl.org/research/education/performance-funding.aspx

Develop a ML model which has the capability to predict at-risk

(who might fail) students and provide this feedback to students

and Professors so that they can take appropriate actions

Page 4: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 4

Use Case: Academic Alert System in Weka

Page 5: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 5

Data Volume (in Prod) Learning Management Systems

1) Student Activity data

Total = ~ 350 million records

Research = 15-18 million records

2) Student Gradebook data

Total = ~ 1.5 million

Research = 100,000 per semester

Student Information systems 1) Demographics

Research = 5500 students per semester x 3

2) Enrollment

Research = 27000 per semester x 3

3) Course

Research = ~2000 per semester x 3

Page 6: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 6

Problem Statement

Small universities have less

students so Weka might work

Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already

been increased, because the Explorer always loads the entire dataset into the computer's main memory.

To scale out for Larger

Universities

How do I

process

45000

students with

20 features?

Page 7: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 7

Analytical Platform

Hardware:

3 Virtual Machines on IBM PureFlex

• 8 cores per VM

• 32 GB RAM, 100GB per VM

Software:

3 node Hadoop cluster

• Spark 1.5.2: Zeppelin, Python, Scala

• Oozie, Hive and Sqoop

Page 8: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 9

Spark Work Flow

Data

Training

Test

Sampling Train_Data Imputation

Model Imputation Test_Data

Fit

Transform

Predictions

Page 9: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 10

What does our Data Look like? Data Sources: Derived from ETL stage

19 features from Learning Management System & Student

Demographics

Count:

Training: 9923

Testing: 5145

Page 10: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 11

Sampling

Label Count

0.0 9267

1.0 656

Label Count

0.0 9267

1.0 9184

1.0 = Student At Risk

Training Data was skewed with only 656 At-Risk Students so we

duplicated At-Risk rows

TRAINING DATA

Page 11: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 12

Imputation

Filling with mean value for numerical columns

Age

SAT scores

Filling with Mode value for Categorical columns

Enrollment Status

Page 12: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 13

Modelling Using Spark ML Package Why?

DataFrame

Build the

Pipeline

Model

String Indexer for

Categorical Variables

Vector

Assembler

Use Model

4 Lines of Code

1 lr = LogisticRegression(maxIter=100, regParam=0.01)

2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer,

RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler,

lr])

3 model_lr = pipeline_lr.fit(trainData)

4 prediction_lr = model_lr.transform(testData)

Page 13: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 14

Logistic Regression Results

Predicted

Actual

0 1

0 4065 720

1 51 309

Spark: Test Data count: 5145

19 Features

Weka: Test Data count: 5145

19 Features

Predicted

Actual

0 1

0 4093 692

1 49 311

309 Students at Risk

85.01 % Accuracy

85.83 % Recall

Time: 20 seconds

311 Students at Risk

85.6 % Accuracy

86.4 % Recall

Time: 49 Seconds

Page 14: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 15

Random Forest Comparison

Predicted

Actual

0 1

0 4065 720

1 51 309

Spark: Data count: 5145

19 Features

Weka: Data count: 5145

19 Features

Predicted

Actual

0 1

0 4186 599

1 83 277

309 Students at Risk

85.01 % Accuracy

85.83 % Recall

Time:16 Seconds

277 Students at Risk

86.7 % Accuracy

76.9 % Recall

Time:30 Seconds

Page 15: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 16

Naive Bayes Comparison

Predicted

Actual

0 1

0 4279 506

1 158 202

Spark: Data count: 5145

19 Features

Weka: Data count: 5145

19 Features

Predicted

Actual

0 1

0 4093 692

1 67 293

202 Students at Risk

87.1 % Accuracy

56.1 % Recall

Time:9 Seconds

293 Students at Risk

85.2 % Accuracy

81.4 % Recall

Time:30 Seconds

Page 16: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 17

Why is this Better?

Data

Training

Test

Sampling Train_Data Imputation

Model Imputation Test_Data

Fit

Transform

Predictions

• Complete Work Flow in one Environment

Zeppelin on Spark

• Java/Scala or Python to choose from

• Distributed Computing

Page 17: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 18

Spark Challenges

No Python support to save and load pipeline model yet

• SPARK-6725, SPARK-13032

ML StringIndexer does not protect itself from column name duplication

• SPARK-12874

PySpark CrossValidatorModel does not support avgMetrics

• SPARK-12810

• You have to create an RDD and then extract the metrics

PMML Export not supported yet

• SPARK-11171

Page 18: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 19

Q&A

Page 19: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 20

LOGISTIC REGRESSION MODEL

Page 20: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 21

Random Forest Code

Page 21: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 22

Naïve Bayes Code

Page 22: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 23

Appendix

Page 23: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 24

IBM Open Platform for Apache Hadoop (IOP)

Includes Spark

100% Open Source

Implement with help from IBM Lab Services

Production Support Offering Available

Apache Open Source Components

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene

IBM Open Platform with Apache Hadoop

Page 24: Apache Spark Use case for Education Industry

© 2016 IBM Corporation 25

Questions??