Scaling Analytics with Apache Spark

43
Location: QuantUniversity Meetup August 8 th 2016 Boston MA Scaling Analytics with Apache Spark 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com [email protected]

Transcript of Scaling Analytics with Apache Spark

Page 1: Scaling Analytics with Apache Spark

Location:

QuantUniversity Meetup

August 8th 2016

Boston MA

Scaling Analytics with Apache Spark

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

[email protected]

Page 2: Scaling Analytics with Apache Spark

2

Slides and Code will be available at: http://www.analyticscertificate.com/SparkWorkshop/

Page 3: Scaling Analytics with Apache Spark

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

Page 4: Scaling Analytics with Apache Spark

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.)

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

4

Page 5: Scaling Analytics with Apache Spark

5

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program in September

Page 6: Scaling Analytics with Apache Spark

(MATLAB version also available)

Page 7: Scaling Analytics with Apache Spark

7

Quantitative Analytics and Big Data Analytics Onboarding

• Apply at: www.analyticscertificate.com

• Program starting September 18th

• Module 1:▫ Sep 18th , 25th , Oct 2nd, 9th

• Module 2:▫ Oct 16th , 23th , 30th, Nov 6th

• Module 3:▫ Nov 13th, 20th, Dec 4th, Dec 11th

• Capstone + Certification Ceremony▫ Dec 18th

Page 8: Scaling Analytics with Apache Spark

8

• August▫ 14-20th : ARPM in New York www.arpm.co

QuantUniversity presenting on Model Risk on August 14th

▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data-bootcamp/event.html

• September▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)

▫ 11th, 12th : Spark Workshop, Boston

▫ 19th, 20th : Anomaly Detection Workshop, New York

Events of Interest

Page 9: Scaling Analytics with Apache Spark

9

Page 10: Scaling Analytics with Apache Spark

Agenda

1. A quick introduction to Apache Spark

2. A sample Spark Program

3. Clustering using Apache Spark

4. Regression using Apache Spark

5. Simulation using Apache Spark

Page 11: Scaling Analytics with Apache Spark

Apache Spark : Soaring in Popularity

Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008

Page 12: Scaling Analytics with Apache Spark

What is Spark ?

• Apache Spark™ is a fast and general engine for large-scale data processing.

• Came out of U.C. Berkeley’s AMP Lab

Lightning-fast cluster computing

Page 13: Scaling Analytics with Apache Spark

Why Spark ?

Speed

Run programs up to 100x faster than Hadoop MapReduce

in memory, or 10x faster on disk.

Spark has an advanced DAG execution engine that

supports cyclic data flow and in-memory computing.

Page 14: Scaling Analytics with Apache Spark

Why Spark ?

• text_file = spark.textFile("hdfs://...")

text_file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a+b)

• Word count in Spark's Python API

Ease of Use

• Write applications quickly in Java, Scala or

Python,R.

• Spark offers over 80 high-level operators that

make it easy to build parallel apps. And you can

use it interactively from the Scala and Python

shells.

• R support recently added

Page 15: Scaling Analytics with Apache Spark

Why Spark ?

• Generality• Combine SQL, streaming, and

complex analytics.• Spark powers a stack of high-level

tools including:1. Spark Streaming: processing real-time

data streams2. Spark SQL and DataFrames: support

for structured data and relational queries

3. MLlib: built-in machine learning library4. GraphX: Spark’s new API for graph

processing

Page 16: Scaling Analytics with Apache Spark

Why Spark?

• Runs Everywhere• Spark runs on Hadoop, Mesos,

standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

• You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos.

• Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

Page 17: Scaling Analytics with Apache Spark

Key Features of Spark

• Handles batch, interactive, and real-time within a single framework

• Native integration with Java, Python, Scala, R

• Programming at a higher level of abstraction

• More general: map/reduce is just one set of supported constructs

Page 18: Scaling Analytics with Apache Spark

Secret Sauce : RDD, Transformation, Action

Page 19: Scaling Analytics with Apache Spark

How does it work?

• Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.

• Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset.

• Actions return a value to the driver program after running a computation on the dataset.

Page 20: Scaling Analytics with Apache Spark

How is Spark different?

• Map – Reduce : Hadoop

Page 21: Scaling Analytics with Apache Spark

Problems with this MR model

• Difficult to code

Page 22: Scaling Analytics with Apache Spark

Getting started

• http://spark.apache.org/docs/latest/index.html

• http://datascience.ibm.com/

• https://community.cloud.databricks.com

Page 23: Scaling Analytics with Apache Spark

Quick Demo

• Test_Notebook.ipyb

Page 24: Scaling Analytics with Apache Spark

Machine learning with Spark

Page 25: Scaling Analytics with Apache Spark

Machine learning with Spark

Page 26: Scaling Analytics with Apache Spark

26

Machine learning with Spark

Page 27: Scaling Analytics with Apache Spark

Use case 1 : Segmenting stocks

• If we have a basket of stocks and their price history, how do we segment them into different clusters?

• What metrics could we use to measure similarity?

• Can we evaluate the effect of changing the number of clusters ?

• Do the results seem actionable?

Page 28: Scaling Analytics with Apache Spark

K-means

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

where μi is the mean of points in Si.

http://shabal.in/visuals/kmeans/2.html

Page 30: Scaling Analytics with Apache Spark

Use-case 2 – Regression

• Given historical weekly interest data of AAA bond yields, 10 year treasuries, 30 year treasuries and Federal fund rates, build a regression model that fits

• Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)

Page 31: Scaling Analytics with Apache Spark

Linear regression• Linear regression investigates the linear relationships between variables and

predict one variable based on one or more other variables and it can beformulated as:

𝑌 = 𝛽0 +

𝑖=1

𝑝

𝛽𝑖𝑋𝑖

where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is aconstant.

• In this model, ordinary least squares estimator is usually used to minimize thedifference between the dependent variable and independent variables.

31

Page 32: Scaling Analytics with Apache Spark

Ordinary Least Squares Regression

Page 33: Scaling Analytics with Apache Spark

Demo

• Regression.ipyb

Page 34: Scaling Analytics with Apache Spark

Scaling Monte-Carlo simulations

Page 35: Scaling Analytics with Apache Spark

Example:

• Portfolio Growth

• Given:▫ INVESTMENT_INIT = 100000 # starting amount

▫ INVESTMENT_ANN = 10000 # yearly new investment

▫ TERM = 30 # number of years

▫ MKT_AVG_RETURN = 0.11 # percentage

▫ MKT_STD_DEV = 0.18 # standard deviation

▫ Run 10000 monte-carlo simulation paths and compute the expected value of the portfolio at the end of 30 years

Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark

Page 36: Scaling Analytics with Apache Spark

36

• The count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements.

• HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset

• Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.

Hyperloglog

Ref: https://en.wikipedia.org/wiki/HyperLogLog

Page 37: Scaling Analytics with Apache Spark

37

Hyperloglog

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n

Ref: https://en.wikipedia.org/wiki/HyperLogLog

Page 38: Scaling Analytics with Apache Spark

38

• Approximate algorithms▫ approxCountDistinct: returns an estimate of the number of distinct

elements▫ approxQuantile: returns approximate percentiles of numerical data

Refer:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/4196864626084292/3601578643761083/latest.html

Demo from Databricks’s blog

Page 39: Scaling Analytics with Apache Spark

39

• As per Databricks’s blog:“Spark strives at implementing approximate algorithms that are deterministic (they do not depend on random numbers to work) and that have proven theoretical error bounds: for each algorithm, the user can specify a target error bound, and the result is guaranteed to be within this bound, either exactly (deterministic error bounds) or with very high confidence (probabilistic error bounds)”

Spark’s implementation

Page 40: Scaling Analytics with Apache Spark
Page 41: Scaling Analytics with Apache Spark

41

www.analyticscertificate.com/SparkWorkshop

Page 42: Scaling Analytics with Apache Spark

42

Q&A

Slides, code and details about the Apache Spark Workshopat: http://www.analyticscertificate.com/SparkWorkshop/

Page 43: Scaling Analytics with Apache Spark

Thank you!Members & Sponsors!

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.com

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not bedistributed or used in any other publication without the prior written consent of QuantUniversity LLC.

43