Apache Spark Use case for Education Industry
-
Upload
vinayak-agrawal -
Category
Data & Analytics
-
view
44 -
download
2
Transcript of Apache Spark Use case for Education Industry
© 2016 IBM Corporation 2
Agenda
Use Case
Use Case Architecture/Work Flow in Weka
Data Volume
Problem Statement
Our Analytical Platform
Spark Workflow
Result Comparison between Weka and Spark
Spark Challenges
Q&A
© 2016 IBM Corporation 3
Use Case: Academic Alert System Academic Institutions get performance based funding on parameters* like
Student Retention – Retention Rates
Student Graduating – Completion Rates
Academic Institutions wants to be proactive in providing academic
feedback to students BEFORE they appear in final exam.
*Source: http:///www.ncsl.org/research/education/performance-funding.aspx
Develop a ML model which has the capability to predict at-risk
(who might fail) students and provide this feedback to students
and Professors so that they can take appropriate actions
© 2016 IBM Corporation 4
Use Case: Academic Alert System in Weka
© 2016 IBM Corporation 5
Data Volume (in Prod) Learning Management Systems
1) Student Activity data
Total = ~ 350 million records
Research = 15-18 million records
2) Student Gradebook data
Total = ~ 1.5 million
Research = 100,000 per semester
Student Information systems 1) Demographics
Research = 5500 students per semester x 3
2) Enrollment
Research = 27000 per semester x 3
3) Course
Research = ~2000 per semester x 3
© 2016 IBM Corporation 6
Problem Statement
Small universities have less
students so Weka might work
Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already
been increased, because the Explorer always loads the entire dataset into the computer's main memory.
To scale out for Larger
Universities
How do I
process
45000
students with
20 features?
© 2016 IBM Corporation 7
Analytical Platform
Hardware:
3 Virtual Machines on IBM PureFlex
• 8 cores per VM
• 32 GB RAM, 100GB per VM
Software:
3 node Hadoop cluster
• Spark 1.5.2: Zeppelin, Python, Scala
• Oozie, Hive and Sqoop
© 2016 IBM Corporation 9
Spark Work Flow
Data
Training
Test
Sampling Train_Data Imputation
Model Imputation Test_Data
Fit
Transform
Predictions
© 2016 IBM Corporation 10
What does our Data Look like? Data Sources: Derived from ETL stage
19 features from Learning Management System & Student
Demographics
Count:
Training: 9923
Testing: 5145
© 2016 IBM Corporation 11
Sampling
Label Count
0.0 9267
1.0 656
Label Count
0.0 9267
1.0 9184
1.0 = Student At Risk
Training Data was skewed with only 656 At-Risk Students so we
duplicated At-Risk rows
TRAINING DATA
© 2016 IBM Corporation 12
Imputation
Filling with mean value for numerical columns
Age
SAT scores
Filling with Mode value for Categorical columns
Enrollment Status
© 2016 IBM Corporation 13
Modelling Using Spark ML Package Why?
DataFrame
Build the
Pipeline
Model
String Indexer for
Categorical Variables
Vector
Assembler
Use Model
4 Lines of Code
1 lr = LogisticRegression(maxIter=100, regParam=0.01)
2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer,
RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler,
lr])
3 model_lr = pipeline_lr.fit(trainData)
4 prediction_lr = model_lr.transform(testData)
© 2016 IBM Corporation 14
Logistic Regression Results
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark: Test Data count: 5145
19 Features
Weka: Test Data count: 5145
19 Features
Predicted
Actual
0 1
0 4093 692
1 49 311
309 Students at Risk
85.01 % Accuracy
85.83 % Recall
Time: 20 seconds
311 Students at Risk
85.6 % Accuracy
86.4 % Recall
Time: 49 Seconds
© 2016 IBM Corporation 15
Random Forest Comparison
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark: Data count: 5145
19 Features
Weka: Data count: 5145
19 Features
Predicted
Actual
0 1
0 4186 599
1 83 277
309 Students at Risk
85.01 % Accuracy
85.83 % Recall
Time:16 Seconds
277 Students at Risk
86.7 % Accuracy
76.9 % Recall
Time:30 Seconds
© 2016 IBM Corporation 16
Naive Bayes Comparison
Predicted
Actual
0 1
0 4279 506
1 158 202
Spark: Data count: 5145
19 Features
Weka: Data count: 5145
19 Features
Predicted
Actual
0 1
0 4093 692
1 67 293
202 Students at Risk
87.1 % Accuracy
56.1 % Recall
Time:9 Seconds
293 Students at Risk
85.2 % Accuracy
81.4 % Recall
Time:30 Seconds
© 2016 IBM Corporation 17
Why is this Better?
Data
Training
Test
Sampling Train_Data Imputation
Model Imputation Test_Data
Fit
Transform
Predictions
• Complete Work Flow in one Environment
Zeppelin on Spark
• Java/Scala or Python to choose from
• Distributed Computing
© 2016 IBM Corporation 18
Spark Challenges
No Python support to save and load pipeline model yet
• SPARK-6725, SPARK-13032
ML StringIndexer does not protect itself from column name duplication
• SPARK-12874
PySpark CrossValidatorModel does not support avgMetrics
• SPARK-12810
• You have to create an RDD and then extract the metrics
PMML Export not supported yet
• SPARK-11171
© 2016 IBM Corporation 19
Q&A
© 2016 IBM Corporation 20
LOGISTIC REGRESSION MODEL
© 2016 IBM Corporation 21
Random Forest Code
© 2016 IBM Corporation 22
Naïve Bayes Code
© 2016 IBM Corporation 23
Appendix
© 2016 IBM Corporation 24
IBM Open Platform for Apache Hadoop (IOP)
Includes Spark
100% Open Source
Implement with help from IBM Lab Services
Production Support Offering Available
Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
IBM Open Platform with Apache Hadoop
© 2016 IBM Corporation 25
Questions??