IBM Strategy for Spark

40
Introduction to Spark

Transcript of IBM Strategy for Spark

Introduction to Spark

Introductions

Garrett Young ([email protected])

1) Introduction to Spark (10 mins)

2) IBM's Commitment to Spark (5 mins)

3) How Predictive Analytic Lifecycles Typically Work (10 mins)

3) Using Spark to Predict Hospital Readmissions (15 mins)

4) How you can get a free-trial Spark environment from IBM (5 mins)

5) Q&A (15 mins)

What is Spark?

• In-memory data processing engine

• Open Source Apache Project

• Cluster Computing Framework

• Can use Scala, Python or R Languages

• Horizontally/Vertically Scalable

• Not a data store

IBM | SPARK – The Analytics Operating System“Enabling New Classes of Intelligent Applications Embedded with Analytics”

• Spark unifies data, enabling real-time insights

• Spark processes and analyzes data from any data source

• Spark is complementary to Hadoop, but faster with in-memory performance

• Build models quickly. Iterate faster. Apply intelligence .

• Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

How Spark Works

HDFS

Read

HDFS

Write

HDFS

Read

HDFS

WriteInput ResultCPU

Iteration 1

Memory CPU

Iteration 2

Memory

• Solution: Keep more data in-memory with a new distributed execution engine

HDFS

ReadInput CPU

Iteration 1

Memory CPU

Iteration 2Memory

faster than

network & disk

Zero

Read/Write

Disk

Bottleneck

How Spark Works

Chain Job

Output

into New Job

Input

General Spark Architecture Overview

• Driver Uses Spark Context to talk to the Cluster Manager

• Executors run their own JVM Processes

• Cluster manager distributes the workload based on information from the Worker

Key Reasons for the Interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala, R

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

What is SparkML?

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering

2. Featurization: feature extraction, transformation, dimensionality reduction, and selection3. Persistence: saving and load algorithms, models, and Pipelines4. Utilities: linear algebra, statistics, data handling, etc.

What is scikit-learn?

• Used for Data Mining and Data Analysis

• Open Source

• Various classification, regression and clustering algorithms

Watson Machine Learning

• Uses both Spark ML and Scikit-Learn plus others

• Built on SPSS plaform

• Can pull from many different data sources

• Integrates with DSX (Beta)

Web Service

Data Access:• Easily connect to Behind-

the-Firewall and Public Cloud Data

• Catalogued and Governed Controls through Watson Data Platform

Creating Models:• Single UI and API for

creating ML Models on various Runtimes

• Auto-Modelling and HyperparameterOptimization

Web Service:• Real-time,

Streaming, and Batch Deployment

• Continuous Monitoring and Feedback Loop

Intelligent Apps:• Integrate ML

models with apps, websites, etc.

• Continuously Improve and Adapt with Self-Learning

IBM DSX Machine Learning

IMS

IBM Machine Learning in Data Science Experience

API for Jupyter Notebooks Wizard GUI

IBM Machine Learning is provisioned by default in Data Science Experience• Enables Data Scientists to deploy machine learning models as web services• Single UI for creating, collaborating, deploying, monitoring, and feedback• Accessible via API, Wizard GUI, and Canvas

IBM's Commitment to Spark

Spark Tech Center (STC): IBM’s Commitment to Spark

0

100

200

300

400

500

600

700

800

900

1000

Databricks IBM Hortonworks Cloudera Intel IVU TrafficTechnologies

Tencent

Top 7 Contributing Companies to Spark 2.0.0

25,600 Spark LOC

606 Spark JIRAs

253 SystemML JIRAs

64 Speakers at events

… and all that with 1 Team

1.5 Years

Databricks

Hortonworks

Cloudera

Intel

Tencent

NTT

Other

IBM Spark Technology Center – San Francisco, CA

As of March 10, 2016

See what we’re up to …IBM Spark Technology Center

http://www.spark.tc/blog/

Fixing lot’s of issues reported by others

Using Spark to Predict Hospital Readmissions &How Predictive Analytic Lifecycles Typically Work

Reducing Hospital Readmissions with Predictive Analytics

An Example ‘Proof of Concept’ Using Open Data

Outline

Problem

Solution

Details

Results

Summary

Problem

Solution

Details

Results

Summary

Problem

Problem : 30-Day Hospital Readmissions costs $41B Annually

Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf

Medicare HRRP – Penalties to Hospitals

Source: Kaiser Family Foundationhttp://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/

Problem

Solution

Details

Results

Summary

Solution

Get Data: Diabetes Readmissions Dataset• University of California Irvine – Machine Learning Repos.

• Open Data• 130 Hospitals, 1999-2008

• 101,766 rows, 50 columns of data

• Diabetes Readmissions• Top ten for Medicaid, Private Insurance and Uninsured

• Not in top ten for Medicare

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Build a Predictive Model : Conceptual View

Step 1: Model Development

Step 2: Perform Predictions

HistoricalData

MachineLearning(Mathematical

Algorithm)

Model

Model PredictionNew Case

IBM Bluemix• Bluemix

• Infrastructure, Watson, software and services on Bluemix Cloud Platform • Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned

Data Science Experience (DSX)• Data Science Experience (DSX)

• Easily execute scala, python and R notebooks• Share notebooks with your data science team

Bluemix Services Architecture in the Cloud

BigInsights HDFS(Hadoop)

Data Connect DashDB

Data Science ExperienceCloudantNode.js Web Form

Training Data Convert to CSV

Predictions

New Records

Predictions

Problem

Solution

Details

Results

Summary

Details

A Look at The Raw Data

Data Science Experience – Python Code

Problem

Solution

Details

Results

Summary

Results

First Pass Results – Are they any good?

AUC = Area Under the Curve

AUC Score 0.6514

0.50 = Random Guessing

1.00 = Perfect Prediction

2nd Pass Results – Are they any good?

AUC = Area Under the Curve

AUC Score 0.6750

0.50 = Random Guessing

1.00 = Perfect Prediction

How Do Other Readmission Models Perform?

“A comparison of models for predicting early hospital

readmissions”

Journal of Biomedical Informatics Volume 56, August 2015, Pages 229–

238

Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969

Which Factors Affect Diabetes Readmission?

Data: Feature Importance from Random Forest Algorithm

The Algorithm can tell us which features (columns) it found important during the training process.

22 columns from original 50

Problem

Solution

Details

Results

Summary

Summary

Summary

• Readmissions Prediction is an important area of research for using Predictive Analytics in Healthcare

• Patient: Improved Outcome

• Hospital Providers: Avoid Penalties

• Payers: Reduce Costs

• In a short amount of time we were able to develop results comparable to leading research studies

How you can get a free-trial Spark Cluster from IBM

Sign Up for Free Account

Data Science Experience with IBM ML

https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/

Notebook Samples