Apache Spark Use case for Education Industry

© 2016 IBM Corporation 1

Academic Alert System

Presenter: Vinayak Agrawal

[email protected]


Agenda

Use Case

Use Case Architecture/Work Flow in Weka

Data Volume

Problem Statement

Our Analytical Platform

Spark Workflow

Result Comparison between Weka and Spark

Spark Challenges

Q&A


Use Case: Academic Alert System Academic Institutions get performance based funding on parameters* like

Student Retention – Retention Rates

Student Graduating – Completion Rates

Academic Institutions wants to be proactive in providing academic

feedback to students BEFORE they appear in final exam.

*Source: http:///www.ncsl.org/research/education/performance-funding.aspx

Develop a ML model which has the capability to predict at-risk

(who might fail) students and provide this feedback to students

and Professors so that they can take appropriate actions


Use Case: Academic Alert System in Weka


Data Volume (in Prod) Learning Management Systems

1) Student Activity data

Total = ~ 350 million records

Research = 15-18 million records

2) Student Gradebook data

Total = ~ 1.5 million

Research = 100,000 per semester

Student Information systems 1) Demographics

Research = 5500 students per semester x 3

2) Enrollment

Research = 27000 per semester x 3

3) Course

Research = ~2000 per semester x 3


Problem Statement

Small universities have less

students so Weka might work

Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already

been increased, because the Explorer always loads the entire dataset into the computer's main memory.

To scale out for Larger

Universities

How do I

process

45000

students with

20 features?


Analytical Platform

Hardware:

3 Virtual Machines on IBM PureFlex

• 8 cores per VM

• 32 GB RAM, 100GB per VM

Software:

3 node Hadoop cluster

• Spark 1.5.2: Zeppelin, Python, Scala

• Oozie, Hive and Sqoop


Spark Work Flow

Data

Training

Test

Sampling Train_Data Imputation

Model Imputation Test_Data

Fit

Transform

Predictions


What does our Data Look like? Data Sources: Derived from ETL stage

19 features from Learning Management System & Student

Demographics

Count:

Training: 9923

Testing: 5145


Sampling

Label Count

0.0 9267

1.0 656

Label Count

0.0 9267

1.0 9184

1.0 = Student At Risk

Training Data was skewed with only 656 At-Risk Students so we

duplicated At-Risk rows

TRAINING DATA


Imputation

Filling with mean value for numerical columns

Age

SAT scores

Filling with Mode value for Categorical columns

Enrollment Status


Modelling Using Spark ML Package Why?

DataFrame

Build the

Pipeline

Model

String Indexer for

Categorical Variables

Vector

Assembler

Use Model

4 Lines of Code

1 lr = LogisticRegression(maxIter=100, regParam=0.01)

2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer,

RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler,

lr])

3 model_lr = pipeline_lr.fit(trainData)

4 prediction_lr = model_lr.transform(testData)


Logistic Regression Results

Predicted

Actual

0 1

0 4065 720

1 51 309

Spark: Test Data count: 5145

19 Features

Weka: Test Data count: 5145

19 Features

Predicted

Actual

0 1

0 4093 692

1 49 311

309 Students at Risk

85.01 % Accuracy

85.83 % Recall

Time: 20 seconds


85.6 % Accuracy

86.4 % Recall

Time: 49 Seconds


Random Forest Comparison

Predicted

Actual

0 1

0 4065 720

1 51 309

Spark: Data count: 5145

19 Features

Weka: Data count: 5145

19 Features

Predicted

Actual

0 1

0 4186 599

1 83 277


85.01 % Accuracy

85.83 % Recall

Time:16 Seconds


86.7 % Accuracy

76.9 % Recall

Time:30 Seconds


Naive Bayes Comparison

Predicted

Actual

0 1

0 4279 506

1 158 202

Spark: Data count: 5145

19 Features

Weka: Data count: 5145

19 Features

Predicted

Actual

0 1

0 4093 692

1 67 293


87.1 % Accuracy

56.1 % Recall

Time:9 Seconds


85.2 % Accuracy

81.4 % Recall

Time:30 Seconds


Why is this Better?

Data

Training

Test

Sampling Train_Data Imputation

Model Imputation Test_Data

Fit

Transform

Predictions

• Complete Work Flow in one Environment

Zeppelin on Spark

• Java/Scala or Python to choose from

• Distributed Computing


Spark Challenges

No Python support to save and load pipeline model yet

• SPARK-6725, SPARK-13032

ML StringIndexer does not protect itself from column name duplication

• SPARK-12874

PySpark CrossValidatorModel does not support avgMetrics

• SPARK-12810

• You have to create an RDD and then extract the metrics

PMML Export not supported yet

• SPARK-11171


Q&A


LOGISTIC REGRESSION MODEL


Random Forest Code


Naïve Bayes Code


Appendix


IBM Open Platform for Apache Hadoop (IOP)

Includes Spark

100% Open Source

Implement with help from IBM Lab Services

Production Support Offering Available

Apache Open Source Components

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene

IBM Open Platform with Apache Hadoop


Questions??

Apache Spark Use case for Education Industry

Data & Analytics

Transcript of Apache Spark Use case for Education Industry