Spark-Zeppelin-ML on HWX
-
Upload
kirk-haslbeck -
Category
Technology
-
view
184 -
download
0
Transcript of Spark-Zeppelin-ML on HWX
Data Science at ScaleSpark – Zeppelin - ML
Kirk Haslbeck, Sr. Solution Engineer HWX
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kirk Haslbeck - Hortonworks
Sr. Solution Engineer @ Hortonworks
Lead Architect for Trade Surveillance @ Morgan Stanley
Masters in Data Mining @UMBC
Computer Science Degree @ Mount Saint Mary’s University
github.com/kirkhas/zeppelin-notebooks
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Apache Open Source Project
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why do we need Spark?
Distributed – Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a
single JVM
Horizontal– Spark can take advantage of a modern data architecture. Scales out as a function of hardware.
Data Science– Language R, Python both growing in popularity and great for statistical workloads but suffer from
single threaded nature.
Need for a top level computing language– SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is
better for some operations and a full programming language for others. Spark satisfies both!
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark API Languages
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark - Functional + Distributed = Concise and Powerful
Spark Map Function Java Thread Pool
Objective: we have a list of tasks and we want to pad each project timeline with 20% time buffer
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark chose Scala?
Functional– Map, Filter, Fold, GroupBy– 5-10X code reduction
Immutable– No state management, less headache, each operation is fully encapsulated.
Thread Safety is the Biggest Challenge
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs, DataFrames and DataSets
Resilient Distributed Dataset– Good for schema – case class Trade (sym: String, price: Double)
DataFrame– SQL like operations, higher level object– aggregations, ordering
Interoperability– Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R,
Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD (low level) vs. DataFrames (new API)
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 101 – Execution Model Spark Driver
– Client side application that creates Spark Context Spark Context
– Talks to Spark Driver, Cluster Manager to Launch Spark Executors Cluster Manager – E.g YARN, Spark Standalone, MESOS Executors – Spark worker bees
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Engine in the HDP Stack
Spark is first-class citizen of Hadoop
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
Show me the Code!
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Model Inputs
Data Gathering
Custom Logic
Process Flow
Evaluate Results
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What About Machine Learning?
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning and Big Data
Machine learning has advanced to the point where it more or less goes hand-in-hand with Big Data. Indeed, so popular is the technology that over a third of developers – some 36 percent – who are working on Big Data or advanced analytics projects use elements of machine learning, says a new study by Evans Data Corp.
Machine Learning involves creating and improving complex algorithms that are able to analyze data automatically and identify patterns or predict outcomes based on the knowledge they have “learned”. As such, it has great potential for helping companies to better understand what their data is telling them.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Where Can We Use Data Science?
Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates
Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens
Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security
Retail• Product recommendation• Inventory management• Price optimization
Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis
Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Customer Use Cases with Spark
Web Analytics - WebTrendsWeb Analytics for Marketing• Ingesting 13 Billion events/Day• Use Spark Streaming & Samza for Data Ingest• Extremely low latency: 40 milliseconds• Need more metrics for Spark Streaming• Wants 2 way SSL for Kafka Spark receiver
Bank/Credit CardReal time monitoring and Fraud Detection• Monitor ATM with NiFi• Start with Log Aggregation• Tackle fraud detection next
Railroad CompanyReal time view of state of track• Optimize the train maintenance • Large volume of track data, down to feel
granularity• GeoSpatial analytics is critical
Cable CompanyOptimize Advertising• Monitor channel changes with Spark Streaming• Correlate changes with Ads/Programming• Allocate Ads real time: Show ads to user who are
watching a show and will stay for > over 20 seconds
• How to optimize Spark App development
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Credit Card Fraud Detection
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building a Model Show of hands, how many have built a “Model”? What are some limitations?
– Conditional based logic: if/else binary decisions
If you need a lot of data to build a good model, what tools can you use?– Data volumes can eliminate the possibility of desktop tools
Sampling?– Well… we better get an even distribution of true and false positives in each sample, but wait that
requires data munging, back to what tools can we use.
Security Concerns?– Extracting data from it’s secure resting place and pushing it into other environments, often times
unsecure files or desktops where Matlab or R can be installed.
Collaboration– Push processing to the data using modern distributed tooling.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“All models are wrong, some are useful”
George E. P. Box
Most limiting factor is the data, with modern systems we are now able to capture more data and hopefully produce better insights
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Card Fraud
Requirement: Detect fraudulent transactions. Goal: Save the card company money and build trust amongst card users. Cut down on
fraudulent crime Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time. Design
– Distance: How far can one travel over a period of time before it is fraudulent?– Category: How can we detect a purchase that a customer wouldn’t likely make?– Frequency: How can we detect purchasing patterns that do not resemble the card holder?
Ideas?– White board some conditional logic, egregiousness vs binary– Back test the data– Build a model per card holder?
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rules, Statistics, Machine Learning
Rule Based Logic– Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to
over engineer.– Example: Spending Limit. Card holder limit = $2,000
• If (currentPurchaseAmount + balance > 2,000) then deny transaction
Statistics– Mean, median, mode, variance, deviation– Anomaly detection. Outliers. (i.e. womens retail example)
Machine Learning– Supervised– Unsupervised– Trainable– Adapt over time
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery
Gathered all Credit Card Transactions– Problem is they didn’t make sense– No identifiable patterns, no log normal curves– Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55
Classification
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outlier Detection: identify abnormal patterns
Example: identify anomaliesFeatures:- Time frequency- Category - Amount- Distance
26 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 26
Hortonworks Data Flow
27 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 27
Hortonworks Data Flow
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning Continued
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classification: predicting a category
Some techniques:- Naïve Bayes- Decision Tree- Logistic Regression- SGD- Support Vector Machines- Neural Network- Ensembles
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Regression: predict a continuous value
Some techniques:- Linear Regression / GLM- Decision Trees- Support vector regression- SGD- Ensembles
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Unsupervised Learning: detect natural patterns
Age State Annual Income Marital status
25 CA $80,000 M
45 NY $150,000 D
55 WA $100,500 M
18 TX $85,000 S
… … … …
No labels
Model Naturally occurring(hidden) structure
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Clustering: detect similar instance groupings
Some techniques:- k-means- Spectral clustering- DB-scan- Hierarchical clustering
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Getting the Proper Fit
Over-fitting:Model over-fits training set, but does not generalize well to new inputs
Under-fitting:Model can’t predict accurately
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Intelligence vs
Data Science
R and Matplotlib now available
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
R and Matlab Visuals in Zeppelin
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Matplotlib with Python
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Appendix – Links to content
Github https://github.com/kirkhas/zeppelin-notebooks
Credit Card Fraud (real-time ML)https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html
Monte Carlo / VaRhttps://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html
Stock Variance https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved