Top 3 Considerations for Machine Learning on Big Data
-
Upload
datameer -
Category
Technology
-
view
571 -
download
2
description
Transcript of Top 3 Considerations for Machine Learning on Big Data
![Page 1: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/1.jpg)
© 2013 Datameer, Inc. All rights reserved.
![Page 2: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/2.jpg)
© 2013 Datameer, Inc. All rights reserved.
Top 3 Things to Consider with Machine Learning on Big Data
Karen HsuElliott Cordo
![Page 3: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/3.jpg)
© 2013 Datameer, Inc. All rights reserved.
About our SpeakersKaren Hsu• Karen is Senior Director, Product Marketing at
Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles.
• Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market.
• Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.
![Page 4: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/4.jpg)
© 2013 Datameer, Inc. All rights reserved.
About our SpeakersElliott Cordo• Elliott is a data warehouse and information
management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle.
• At Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing.
![Page 5: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/5.jpg)
© 2013 Datameer, Inc. All rights reserved.
Drivers &Challenges Use Cases Key Criteria Best
Practices Next Steps
![Page 6: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/6.jpg)
Drivers & Challenges
![Page 7: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/7.jpg)
© 2013 Datameer, Inc. All rights reserved.
$0
$75
$150
$225
$300
12/31/0903/31/10
06/30/1009/30/10
12/31/1003/31/11
06/30/1109/30/11
12/31/1103/31/12
06/30/1209/30/12
12/31/1203/21/13
Amazon vs Barnes & Noble
$0
$75
$150
$225
$300
12/31/0903/31/10
06/30/1009/30/10
12/31/1003/31/11
06/30/1109/30/11
12/31/1103/31/12
06/30/1209/30/12
12/31/1203/21/13
NetFlix vs Blockbuster
Big Data Analytics Drives Results
Big Data Drives Results
![Page 8: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/8.jpg)
© 2013 Datameer, Inc. All rights reserved.
• Hard to use• Requires PHD experts• Must write code• Expensive
• Fixed DW models• Must write code for
analytics• Very high IT labor
costs• Not agile
• Easy for small teams• Can’t manage large data
volume• Lack support of advanced
analytics
DataMining
TraditionalBI
Visualization
Alternatives Are Lacking
![Page 9: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/9.jpg)
© 2013 Datameer, Inc. All rights reserved.
Job Title Bay Area New YorkIT Project Manager 140,000.00 $126,000.00System Administrator 117,000.00 $105,000.00Network Administrator 119,000.00 $107,000.00Database Administrator
125,000.00 $119,000.00IT Security Manager 116,000.00 $104,000.00Business Intelligence Analyst 137,000.00 $133,000.00
Data Scientist 138,000.00 $133,000.00Java Developer 136,000.00 $133,000.00QA Engineer 120,000.00 $114,000.00
1,148,000.00 $1,074,000.00
$1M+ in Salaries
$1M+ in CapitalSolution Cost / 100TB
Teradata EDW 1,650,000.00Oracle Exadata 1,400,000.00IBM Netezza 1,000,000.00
Costs of Building Can be $1M+
![Page 10: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/10.jpg)
Use Cases
![Page 11: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/11.jpg)
© 2013 Datameer, Inc. All rights reserved.
Use Case What is Revealed
Profiling and segmentation Customer, product, market characteristics and segments
Acquisition and retention
What leads a person to become a customer or stop being a customer
Product development and operations optimization
What led to product or network failure
Campaign management Patterns of successful campaigns
Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile
Use Cases
![Page 12: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/12.jpg)
© 2013 Datameer, Inc. All rights reserved.
Industry Use Case
Financial Services• Show correlation between services purchased and
investments/trades made• Identify customer segments• Recommendations for research articles to drive trading
eCommerce• Show types of events person will like• Decision tree based on likelihood to click through• Recommendations for a large “cold start” population
Gaming• Clustering for user profiles• Correlation between attributes of a game and behavior• Churn analysis
Healthcare • Recommend tests or other offerings• Identify factors/trends that lead to disease
Customer Examples
![Page 13: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/13.jpg)
Polling Question I
![Page 14: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/14.jpg)
Key Criteria
![Page 15: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/15.jpg)
© 2013 Datameer, Inc. All rights reserved.
Ease of Use Quality
![Page 16: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/16.jpg)
Clustering
![Page 17: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/17.jpg)
© 2013 Datameer, Inc. All rights reserved.
K-Means
1. Treats items as coordinates2. Places a number of random
“centroids” and assigns the nearest items
3. Moves the centroids around based on average location
4. Process repeats until the assignments stop changing
*Diagram from Collective Intelligence by Toby Segaran
• K-means is a popular and versatile general purpose clustering algorithm.
• Commonly used to group people and objects together to form segments
• Often leveraged to enhance recommendation and search systems
How it works
Clustering Overview
![Page 18: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/18.jpg)
© 2013 Datameer, Inc. All rights reserved.
First, the set up...
And then run the results...
In Datameer, you select the columns... And get the results
And the quality of results increases with larger data sets…
Ease of Use
And write additional code to scale...
![Page 19: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/19.jpg)
© 2013 Datameer, Inc. All rights reserved.
pca <- princomp(iris[1:4]);colors <- kmeans(iris[1:4], 3)$cluster;plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5);
First, you have to set up...
And then run the results...
In Datameer, you select the columns... And get the results
And then write more code to scale...
Ease of Use
![Page 20: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/20.jpg)
© 2013 Datameer, Inc. All rights reserved.
Second, you need to create the cluster...
First, select the data...
And then see the results
In Datameer, you select the columns... And get the results
Ease of Use
![Page 21: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/21.jpg)
© 2013 Datameer, Inc. All rights reserved.*Diagram from Collective Intelligence by Toby Segaran
User Location Company Favorite Algo
Elliott New Jersey Caserta K-Means
Karen California Datameer K-Means
User Location Company Favorite Algo1001 1 101 1001
1002 2 102 1001
1. First a dataset’s attirbutes must be converted to numeric representations
Ease of Use
In Datameer, you select the columns... And get the results
2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse
3. Mahout is called, number of clusters, distance calculation is specifiedbin/mahout kmeans \ -i /user/kmeans/vectors \ -c /user/kmeans/input \ -o /user/kmeans/output \ -k 200 \ -dm CosineSimilarity \ -x 20\ -ow
4. The sparse vector output is then converted back to a delimted format,
5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster
![Page 22: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/22.jpg)
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
![Page 23: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/23.jpg)
Column Dependencies
![Page 24: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/24.jpg)
© 2013 Datameer, Inc. All rights reserved.
A Ba xb yb ya xc za y
Column Dependency ~
0.99
C Da xb xb ya zc ya y
Column Dependency ~
0.01
Value•See how data is related after joining multiple sets of data•See column dependencies on multiple types of data
Column Dependencies Overview
![Page 25: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/25.jpg)
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
-3 -2 -1 0 1 2 3
-2-1
01
2ColumnDependency(A,B) = 0
Column A
Col
umn
B
-2 -1 0 1 2 3
-50
5
ColumnDependency(A,B) = 0.5
Column AC
olum
n B
-2 -1 0 1 2
-50
5
ColumnDependency(A,B) = 0.5
Column A
Col
umn
B-3 -2 -1 0 1 2 3
-6000
-4000
-2000
02000
4000
6000
ColumnDependency(A,B) = 1
Column A
Col
umn
B
ColumnDependency(A,B) = 0.5
Column A (NUMBER)
Col
umn
B (S
TRIN
G)
0 0.5 1 1.5 2 2.5 3
ab
c
ColumnDependency(A,B) = 1
Column A (NUMBER)
Col
umn
B (S
TRIN
G)
1 2 3 4 5 6 7 8 9 10 12 14
ab
cd
ef
gh
ij
klm
no
![Page 26: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/26.jpg)
Decision Tree
![Page 27: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/27.jpg)
© 2013 Datameer, Inc. All rights reserved.
Goal: Create a model that predicts the value of a target based on several inputs.
Decision Tree Overview
![Page 28: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/28.jpg)
© 2013 Datameer, Inc. All rights reserved.
packages.install(rpart);library(rpart);treeInput <- read.csv("/PathToData/iris.csv");fit <- rpart(class ~ sepalLength+sepalWidth+petalLength+petalWidth, data=treeInput);par(mfrow=c(1,2), xpd=NA);plot(fit);text(fit, use.n=TRUE);
First, you need to code...
And then run the results...
And then write more code to scale...
In Datameer, you select the columns... And get the results
Ease of Use
![Page 29: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/29.jpg)
© 2013 Datameer, Inc. All rights reserved.
Second, you configure the settings...
First, select the data...
And then see the results
In Datameer, you select the columns... And get the results
Ease of Use
![Page 30: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/30.jpg)
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
Iris WineBreast Cancer
Wisconsin
R 92.66% 86.47% 92.86%
Weka 95.33% 89.33% 93.5%
Datameer 93.33% 91.18% 93.04%
![Page 31: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/31.jpg)
Recommendations
![Page 32: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/32.jpg)
© 2013 Datameer, Inc. All rights reserved.
Increased revenue
Your customers expect them
What makes a good recommendation?
Combination of algorithms and Hadoop make effective recommendations platform achievable
Recommendations Overview
![Page 33: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/33.jpg)
© 2013 Datameer, Inc. All rights reserved.
# run factorization of ratings matrix$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \ --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2
# compute recommendations$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \ --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \ --numRecommendations 6 --maxRating 5 --numThreads 2
First, the set up...
And then run the results...
In Datameer, you select the columns... And get the results
1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.75,124:4.7,319:4.703,29:4.67,591:4.6]4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]
Ease of Use
![Page 34: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/34.jpg)
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
Shawshank Godfather PulpFiction
FightClub
Dianna 4.76 4.98 1.95 2.44
Jon 1.99 2.51 2.87 4.83
Karen 3.28 4.72 1.89 2.95
Elliott 2.92 3.64 2.97 4.83
Same Results
![Page 35: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/35.jpg)
Best Practices
![Page 36: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/36.jpg)
© 2013 Datameer, Inc. All rights reserved.
Big Data Analytics Process
Integrate
Prepare andAnalyze
Visualize
DefineDeploy
AdHoc
Production
![Page 37: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/37.jpg)
© 2013 Datameer, Inc. All rights reserved.
• Leverage Hierarchies
• If possible, use numbering schemes
• Scale the surrogate key of attributes
• Try different cluster sizes
• Avoid numeric similarities when building your data
Clustering
![Page 38: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/38.jpg)
© 2013 Datameer, Inc. All rights reserved.
• Leverage a combination of algorithms
• Clustering is your friend!
• Treat cold start situations differently
• Think about ranking
• Don’t let recommendations go wild
Item-Based K-Means:Similar
Item Similarity
Best Recommendations
Recommendations
![Page 39: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/39.jpg)
© 2013 Datameer, Inc. All rights reserved.
Process Best Practices
IterateMap Chain
![Page 40: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/40.jpg)
Demonstration
![Page 41: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/41.jpg)
Polling Question II
![Page 42: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/42.jpg)
© 2013 Datameer, Inc. All rights reserved.
FunnelOptimization
BehavioralAnalytics
FraudPrevention
EDWOptimization
CustomerSegmentation
Increase Customer conversion by 3x
Increase Revenue by 2x
Identify $2B in potential fraud
98% OpEx savings$1M+ CapEx
savings
Lower Customer Acquisition Costs by
30%
Return on Investment
![Page 43: Top 3 Considerations for Machine Learning on Big Data](https://reader033.fdocuments.us/reader033/viewer/2022052823/5555c5c1d8b42aaf158b479c/html5/thumbnails/43.jpg)
© 2013 Datameer, Inc. All rights reserved.
WorkshopContact•Elliott Cordo [email protected]
•Karen Hsu [email protected]
Call to Action